Uploaded by 304118759


A Min Tjoa · Li Da Xu
Maria Raffai · Niina Maarit Novak (Eds.)
Research and Practical
Issues of Enterprise
Information Systems
10th IFIP WG 8.9 Working Conference, CONFENIS 2016
Vienna, Austria, December 13–14, 2016
Lecture Notes
in Business Information Processing
Series Editors
Wil M.P. van der Aalst
Eindhoven Technical University, Eindhoven, The Netherlands
John Mylopoulos
University of Trento, Trento, Italy
Michael Rosemann
Queensland University of Technology, Brisbane, QLD, Australia
Michael J. Shaw
University of Illinois, Urbana-Champaign, IL, USA
Clemens Szyperski
Microsoft Research, Redmond, WA, USA
More information about this series at http://www.springer.com/series/7911
A Min Tjoa Li Da Xu
Maria Raffai Niina Maarit Novak (Eds.)
Research and
Practical Issues of
Enterprise Information
10th IFIP WG 8.9 Working Conference, CONFENIS 2016
Vienna, Austria, December 13–14, 2016
A Min Tjoa
Vienna University of Technology
Maria Raffai
Szechenyi Istvan University
Li Da Xu
Old Dominion University
Norfolk, VA
Niina Maarit Novak
Vienna University of Technology
ISSN 1865-1348
ISSN 1865-1356 (electronic)
Lecture Notes in Business Information Processing
ISBN 978-3-319-49943-7
ISBN 978-3-319-49944-4 (eBook)
DOI 10.1007/978-3-319-49944-4
Library of Congress Control Number: 2016957480
© IFIP International Federation for Information Processing 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
International Conference on Research and Practical
Issues of Enterprise Information Systems
10th IFIP WG 8.9 Working Conference,
December 13–14, 2016
Austrian Computer Society, Vienna, Austria
This year’s CONFENIS 2016 marked the 10th anniversary of the IFIP WG 8.9.
Working Conference on Research and Practical issues of Enterprise Information Systems. It was a great pleasure that on this special occasion the conference returned to its
birthplace and was held at the Austrian Computer Society in downtown Vienna,
Austria, during December 13–14, 2016.
On the occasion of the anniversary we would like to take some time to look back.
The idea for a new conference series was born in Portugal, Guimarães, in 2005 at an
IFIP TC8 meeting (International Federation for Information Processing, Technical
Committee for Information Systems) with the aim of building a forum to deal with the
increasingly important area of enterprise information systems (EIS). In this particular
meeting, the committee members intensively discussed the initiative and proposal of
professors A Min Tjoa, Maria Raffai, and Li Da Xu, and agreed that enterprise
information systems is an important and dominant scientific subdiscipline. Hence, in
this meeting it was decided by the TC8 members that the community involved in this
area of information systems had to organize its first International Conference on
Research and Practical Issues of Enterprise Information Systems (CONFENIS 2006) in
order to prove its ability to be a leading organization for professionals in this area. This
conference was held in Vienna, Austria, in April 2006 with around 86 accepted papers
that were published by Springer in a proceedings book. This first conference was more
successful than expected, and thus a new Working Group, the WG 8.9, was established
in 2006 at the TC8 meeting in Santiago de Chile.
At that time a new conference series was born! It is worth reviewing the venues
of these conferences:
Vienna, Austria
Beijing, China
Milan, Italy
Gyor, Hungary
Natal, Brazil
Aalborg, Denmark
Ghent, Belgium
International Conference on Research and Practical
Prague, Czech Republic
Hanoi, Vietnam
Daejeon, South Korea
Vienna, Austria
On the occasion of the jubilee we would like to express our thanks to all Program
Committee members, conference and workshop organizers, and the colleagues who
reviewed the papers very thoroughly during this 10-year period. Looking back over the
past decade we can establish that because of the high number of submissions and the
quality of the submitted papers, the reviewing process was an extraordinarily challenging task. We are therefore deeply grateful to many individual reviewers who
worked with us so diligently. Without their efforts, support, and helpfulness the
CONFENIS Series would have never been so successful.
The 2016 edition of the International Conference on Research and Practical Issues
of Enterprise Information Systems (CONFENIS 2016) focused mainly on aspects of
semantic concepts, open data, customer relationship management, security and privacy
issues, advanced manufacturing and management aspects, business intelligence as well
as decision support in EIS and EIS practices. This conference provided an international
forum for the broader IFIP community to discuss the latest research findings in the area
of EIS. The conference specifically aimed at facilitating the exchange of ideas and
advances on all aspects and developments of EIS.
CONFENIS 2016 received 63 high-quality papers from 15 countries on five continents. After a rigorous peer-reviewing process, a total of 24 papers were accepted. We
believe that the selected papers will trigger further EIS research and improvements. We
express our special thanks to the authors for their valuable work and to the Program
Committee members for their advice and support. At the same time, we would like to
acknowledge the great support by the Austrian Computer Society, the TU-WIEN, and
the organization team for their timely contribution and help, which made this edition
of the conference proceedings possible.
Finally, we believe that this 10-year jubilee conference organized again in Vienna
contributed toward innovative approaches to the various issues of EIS, and will continue to offer a platform for further discussions among researchers in the different EIS
areas in the future, too.
December 2016
A Min Tjoa
Maria Raffai
Li Da Xu
Niina Maarit Novak
General Chair
Li Da Xu
Old Dominion University, USA
A Min Tjoa
Vienna University of Technology, Austria
Program Committee
Stephan Aier
Gabriele Anderst-Kotsis
Amin Anjomshoaa
Rogério Atem de Carvalho
Josef Basl
Larisa Bulysheva
Sohail S. Chaudhry
Ba Lam Do
Petr Doucek
Frederik Gailly
Jingzhi Guo
Wu He
Dimitris Karagiannis
Nittaya Kerdprasop
Subodh Kesharwani
Ismail Khalil
Elmar Kiesling
Ling Li
Lisa Madlberger
Milos Maryska
Young Moon
Charles Möller
Niina Maarit Novak
Alex GC Peng
Jan Pries-Heje
Lene Pries-Heje
Maryam Rabiee
Maria Raffai
Aryan Peb Ruswono
Lisa Seymour
University of St. Gallen, Switzerland
Johannes Kepler University Linz, Austria
Massachusetts Institute of Technology (MIT), USA
Instituto Federal Fluminense, Brazil
University of Economics Prague, Czech Republic
Old Dominion University, USA
Villanova University, USA
Hanoi University of Science and Technology, Vietnam
University of Economics, Prague, Czech Republic
Ghent University, Belgium
University of Macau, Macau, SAR China
Old Dominion University, USA
University of Vienna, Austria
Suranaree University of Technology, Thailand
Indira Gandhi National Open University, India
Johannes Kepler University, Austria
Vienna University of Technology, Austria
Old Dominion University, USA
Vienna University of Technology, Austria
University of Economics, Prague, Czech Republic
Syracuse University, USA
Aalborg University, Denmark
Vienna University of Technology, Austria
University of Sheffield, UK
Roskilde University, Denmark
IT University of Copenhagen, Denmark
Columbia University, USA
Szechenyi University, Hungary
Institut Teknologi Bandung, Indonesia
University of Cape Town, South Africa
Christine Strauß
Zhaohao Sun
A Min Tjoa
Tuan-Dat Trinh
Pan Wang
Peter Wetz
Kefan Xie
Li Da Xu
Win Zaw
Chris Zhang
Shang-Ming Zhou
University of Vienna, Austria
PNG University of Technology, Papua New Guinea
Vienna University of Technology, Austria
Vienna University of Technology, Austria
Wuhan University of Technology, China
Vienna University of Technology, Austria
Wuhan University of Technology, China
Old Dominion University, USA
Yangon Technological University, Myanmar
University of Saskatchewan, Canada
Swansea University, UK
Organizing Chair
Niina Maarit Novak
Vienna University of Technology, Austria
Semantic Concepts and Open Data
Semantic Audit Application for Analyzing Business Processes . . . . . . . . . . .
Ildikó Szabó and Katalin Ternai
Using Application Ontologies for the Automatic Generation of User
Interfaces for Dialog-Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
Michael Hitz and Thomas Kessel
Semantic-Based Recommendation Method for Sport News Aggregation
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quang-Minh Nguyen, Thanh-Tam Nguyen, and Tuan-Dung Cao
Using SPEM to Analyze Open Data Publication Methods . . . . . . . . . . . . . .
Jan Kučera and Dušan Chlapek
OGDL4M Ontology: Analysis of EU Member States National PSI Law. . . . .
Martynas Mockus
Customer Relationship Management
Social Media and Social CRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Antonín Pavlíček and Petr Doucek
An Approach to Discovery of Customer Profiles. . . . . . . . . . . . . . . . . . . . .
Ilona Pawełoszek and Jerzy Korczak
Security and Privacy Issues
Cyber Security Awareness and Its Impact on Employee’s Behavior . . . . . . . .
Ling Li, Li Xu, Wu He, Yong Chen, and Hong Chen
Lessons Learned from Honeypots - Statistical Analysis of Logins
and Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pavol Sokol and Veronika Kopčová
Towards a General Information Security Management Assessment
Framework to Compare Cyber-Security of Critical Infrastructure
Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Edward W.N. Bernroider, Sebastian Margiol, and Alfred Taudes
Advanced Manufacturing and Management Aspects
From Web Analytics to Product Analytics: The Internet of Things as a
New Data Source for Enterprise Information Systems . . . . . . . . . . . . . . . . .
Wilhelm Klat, Christian Stummer, and Reinhold Decker
Enterprise Information Systems and Technologies in Czech Companies
from the Perspective of Trends in Industry 4.0 . . . . . . . . . . . . . . . . . . . . . .
Josef Basl
Internet of Things Integration in Supply Chains – An Austrian Business
Case of a Collaborative Closed-Loop Implementation . . . . . . . . . . . . . . . . .
Andreas Mladenow, Niina Maarit Novak, and Christine Strauss
Application of the papiNet-Standard for the Logistics of Straw Biomass
in Energy Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jussi Nikander
A Case-Base Approach to Workforces’ Satisfaction Assessment . . . . . . . . . .
Ana Fernandes, Henrique Vicente, Margarida Figueiredo, Nuno Maia,
Goreti Marreiros, Mariana Neves, and José Neves
Effective Business Process Management Centres of Excellence . . . . . . . . . . .
Vuvu Nqampoyi, Lisa F. Seymour, and David Sanka Laar
Business Intelligence and Big Data
Measuring the Success of Changes to Existing Business Intelligence
Solutions to Improve Business Intelligence Reporting . . . . . . . . . . . . . . . . .
Nedim Dedić and Clare Stanier
An Architecture for Data Warehousing in Big Data Environments. . . . . . . . .
Bruno Martinho and Maribel Yasmina Santos
Decision Support in EIS
The Reference Model for Cost Allocation Optimization and Planning
for Business Informatics Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Petr Doucek, Milos Maryska, and Lea Nedomova
An Entropy Based Algorithm for Credit Scoring . . . . . . . . . . . . . . . . . . . . .
Roberto Saia and Salvatore Carta
Visualizing IT Budget to Improve Stakeholder Communication
in the Decision Making Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alexia Pacheco, Gustavo López, and Gabriela Marín-Raventós
Implementing an Event-Driven Enterprise Information Systems
Architecture: Adoption Factors in the Example of a Micro
Lending Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kavish Sookoo, Jean-Paul Van Belle, and Lisa Seymour
Software Innovation Dynamics in CMSs and Its Impact on Enterprise
Information Systems Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Andrzej M.J. Skulimowski and Inez Badecka
Optimization of Cloud-Based Applications Using Multi-site
QoS Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hong Thai Tran and George Feuerlicht
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Semantic Concepts and Open Data
Semantic Audit Application for Analyzing
Business Processes
Ildikó Szabó(&) and Katalin Ternai
Department of Information Systems, Corvinus University of Budapest,
Fővám tér 13-15, Budapest 1093, Hungary
[email protected],
[email protected]
Abstract. Standard regulations are used to assess the compliance of business
operations by auditors. This procedure is too time-consuming and Computer
Assisted Audit Tools lack of the feature of processing documents semantically
in an automatic manner. This paper presents a semantic application which is
capable of extracting business process models in the shape of process ontologies
from business regulations based on reference process ontologies transformed
from process models derived from standard regulations. The application uses
ontology matching to discover deviations of a given business operation and
creates a transparent report for auditors. This semantic tool has been tested on
one of the Internationalization processes in the respect of Erasmus mobility.
Keywords: Audit Business process management
ontology Ontology matching
Text mining
1 Introduction
Auditing information systems and business processes provide a valuable feedback
about business management. Constraints derived from regulations, guidelines, standards provide compliance requirements. In the light of them, auditors have to investigate information reflecting business operations. This information can be extracted
from transaction data stored in operational databases, data warehouses or resides in
internal regulations, handbooks, event logs, waiting for their interpretation.
Computer Assisted Audit Tools and Techniques (CAATTs) support auditors by
reports resulted from data analysis. Their main functions are to investigate internal
logic of an application directly with testing transactional data produced by an IT
application after feeding it with real or dummy data or by executing parallel simulation.
Moreover there are indirect methods like Generalized Audit Systems and embedded
audit modules for scrutinizing the compliance of an application. The latter focuses on
monitoring transactions within the application. GAS are used to extract and analyze
data. Two GAS systems – ACL and IDEA – were widely used among participants of
the study conducted by Braun and Davis [1].
The lack of CAATs that they do not investigate the compliance of business processes directly and use only fact data provided by an information system. Evaluation
based on processing business regulations is missing from them.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 3–15, 2016.
DOI: 10.1007/978-3-319-49944-4_1
I. Szabó and K. Ternai
But regulations hide processes per se which can be extracted from them with
creating process models or executing text mining. Process models extracted from
business and standard regulations provide a basis to discover deviations between actual
business operation and requirements articulated in regulations. Discovering deviations
requires structural investigation of these processes within a context-dependent environment. Ontology-based approach using text mining is one way to fulfil this
requirement. Because the aim of creating ontologies is to specify a conceptualization of
a given domain and text mining can help to build ontologies in a semi-automatic
manner, hence ontologies can reflect contexts and their concepts unambiguously.
Process ontologies preserve business process elements from the models in a unified
way. Ontology matching can enhance structural and semantic investigation of process
These techniques – creating process ontologies with text mining, running ontology
matching and interpreting its result – can underlie a tool which is capable of processing
business regulations, standard regulations and revealing discrepancies and similarities
between them, focusing on the meaning of these documents. This feature is missing
from the computer assisted audit tools, hence this application can be regarded as a
solution to fill this gap.
This paper presents an application that melts these techniques down and provides a
transparent report from discovered knowledge for auditors. Several countries have
adopted quality audits in their higher education system. The process of quality audit in
the education system focuses primarily on procedural issues rather than on the results
or the efficiency of a quality system implementation. Internationalization is part of the
quality culture of a higher education institution, as well. The institutions are getting
increasingly motivated to participate in an internationalization audit, that’s why this
domain was selected as a use case of this application.
The aim of this paper is to show a semantic application using text mining which is
to assist institutions with valuable feedback about their international activities and to
improve the chance of their international accreditation. Section 2 presents the
above-mentioned semantic techniques (process ontology, text mining on this field,
ontology matching). The process along this application works is showed in Sect. 3. The
results of this implemented application are explained by the one of the Internationalization processes in the respect of Erasmus mobility in Sect. 4.
2 Semantic Theoretical Background
All organizations represent their processes based on general and specific characteristics
which are conditioned to locations, ranges or antecedents. The boundary conditions and
restrictions turn up environmental makings maybe regulatory elements or best practices
from the levels of business process maturity. Business process management
(BPM) provides support for managing the processes of organizations and facilitating
their adaptation to dynamic changing environment. BPM encompasses methods, techniques, and tools to design, enact, control, and analyze operational business processes
involving humans, organizations, applications, documents, and other sources of information [2]. Modern BPM suites are evolving to automate the modeling, monitoring and
Semantic Audit Application for Analyzing Business Processes
redesign of complex processes, although there are still many open issues to be
addressed. Conceptual model captures the semantics of a process through the use of a
formal notation, but the descriptions resulting from conceptual model are intended to be
used by humans. The semantics contained in these models are in a large extent implicit
and cannot be processed. With the web-based semantic schema such as Web Ontology
Language (OWL), the creation and the use of the conceptual models can be improved,
furthermore the implicit semantics being contained in the models can be partly articulated and used for processing [3] Ontologically represented process models allow
querying on a relatively high level of abstraction.
The usage of Semantic Web technologies like reasoners, ontologies, and mediators
promises business process management a completely new level of possibilities. This
approach is known as semantic business process management (SBPM). [4] When a
new regulation is established the business process has to comply with this regulation.
In SBPM the business processes as well as the new regulation are defined in a way that
a machine is able to understand, therefore no manual work is needed to verify that the
business processes comply with the new regulation.
Process Ontologies
Process ontologies are created in order to describe the structure of a process, whereas
organization related ontologies provide a description of artefacts or actors that are
utilized or involved in the process. The domain ontologies provide additional information specific to an organization from a given domain.
Process ontologies have no precise definition in the academic literature. Some refer
to process ontology as a conceptual description framework of processes [5]. In this
interpretation process ontologies are abstract and general. Contrary, task ontologies
determine a smaller subset of the process space, the sequence of activities in a given
process. In our approach the concept of process ontologies is used, where ontology
holds the structural information of processes with multi-dimensional meta-information
partly to ground the channeling of knowledge embedded in domain ontologies. We
present an approach for representing business processes semantically, by translating
them into process ontology that captures the implicit and explicit semantics of the
process model. We have also implemented a translation tool to convert business process model to its OWL representation, serving as a basis for further analysis.
We elaborated a method for extracting a business process in the shape of a process
ontology with using semantic text mining from documents. Two process ontologies are
served as a basis for detecting deviations in business processes. Ontology learning and
matching techniques were integrated into our application.
Ontology Learning
The objective of ontology learning is to generate ontologies with using natural language processing and machine learning techniques. Text mining techniques like similarity measures, pattern recognitions etc. are used to extract terms and their
relationships to build ontology [6].
I. Szabó and K. Ternai
Methontology is one of the most known methodologies for ontology construction,
supplying a set of reference tasks necessaries to build an ontology. It is a general,
domain independent methodology, which defines the main activities of the ontology
construction process and specifies the steps for performing them. [13] Some approaches
construct domain ontologies that reflects the domain covered by the input texts, and not
top level, highly abstract ontologies or lexicalized ontologies (WordNet). A couple of
realization for example: Text2Onto – Text2Onto [14] OntoLearn – TermExtractor,
WCL System [10], and SPRAT – SPRAT [11].
The relevant techniques for automating, the conceptualization are:
1. Build glossary of terms (term extraction),
2. Build concept taxonomies,
3. Identify ad-hoc relations.
Other techniques cover also other tasks for example: Describe rules [12].
Term extraction is usually supported by linguistic and statistical of technique used
jointly. For building concept taxonomies structural and contextual of techniques are
often used. For identify ad-hoc relations, pattern based techniques are often used.
Relevant tools combine several techniques resulting a hybrid method.
Ontology Matching
Alasoud et al. [7] define ontology matching problem identifying semantic correspondences between the components of the entities of ontologies.
Element-level matching techniques focus on matching entities and its instances
without any information about their relationships with other entities and instances.
Structure-level matching techniques address to scrutinize not only matching entities but
their relations with other entities and instances as well [8].
To build our semantic audit application requires an ontology matching tool that
fulfill the following criteria:
– It must be customized to adapt the changes of new or improved process models into
the audit report.
– It must be integrated with other components to ensure that ontology building and
matching procedures can work together.
– Technical report provided by this tool must be structured texts in order to process
them automatically.
– It must handle different languages of process ontologies (RDF, XML, OWL).
Ontology Alignment Evolution Initiative contest inspires researchers to develop
new ontology matching tools. LOGMap, Yam++ and Protégé 4 OWL Diff were
investigated based on these criteria in [15]. Though Protégé 4 OWL Diff is capable of
just running structure-level investigation but it provides libraries to handle different
ontology languages, provide open source codes and well-structured technical reports.
This tool was used to develop this semantic audit application.
Semantic Audit Application for Analyzing Business Processes
3 Semantic Audit Application
Auditors have to collect evidence that operations of companies comply with requirements articulated in guidelines, standards, policies etc. These evidences are settled in
document or data provided by information systems. As we have seen in Sect. 1,
CAATTs tools are usually not capable of processing documents semantically. The
main functional requirements of this semantic audit application are the following ones.
It must be capable of:
– Processing organizational documents in an automated manner
– Focusing on semantic contents of these documents
– Interpreting these contents in the respect of the requirements extracted from reference documents
– Comparing semantic contents of organizational and reference documents
– Presenting the result of this comparison into an interpretable and transparent report.
The process of this semantic audit application is presented in Fig. 1.
The first phase is to create an Adonis process model from the standard regulation
(1) and transform it into the Reference Process Ontology (RPO) (3) with using XSLT
transformation (2) [16].
ADONIS is a graph-structured Business Process Management language. Its main
feature is the method independence. Our approach is principally transferable to other
semi-formal modeling languages. The semantic annotation for specifying the semantics
of the tasks and decisions in the process flow explicitly is important in our method.
For conceptualization, several parameters have to be set or defined when modeling
a business process. Vertically, we can specify operational areas only, or process areas,
process models, sub processes, activities, or even deeper; the algorithms. Horizontally,
extra information can be modeled within the business process: organizational information can be specified in an organogram; the roles can be referred in the RACI
(Responsible, Accountable, Consulted, Informed) matrix of the process model, the
input and the output documents in the document model and the applied IT system
elements can be added to the IT system model as well.
In the second step (2), the mapping of the conceptual process models to process
ontology concepts will be shown. The transformation procedure follows a metamodeling approach. The links between model elements and ontology concepts have
been established. The process ontology describes both semantics of the modeling
language constructs as well as semantics of model instances.
In order to map the conceptual models to ontology concepts, the process models are
exported in the structure of ADONIS XML format. The converter maps the Adonis
Business Process Modeling elements to the appropriate Ontology elements in
meta-level. The model transformation aims at preserving the semantics of the business
model. The general rule followed by us is to express each ADONIS model element as a
class in the ontology and its corresponding attributes as attributes of the class. This
conversion is performed with an XSLT script and results the Reference Process
Ontology (RPO).
I. Szabó and K. Ternai
To represent the business model in the ontology, the representation of ADONIS
model language constructs and the representation of ADONIS model elements have to
differentiate. ADONIS model language constructs are created as classes and properties
and the ADONIS model elements can be represented through the instantiation of these
classes and properties in the ontology. The linkage of the ontology and the ADONIS
model element instances are accomplished by the usage of properties.
The second phase is to build the Organizational process ontology (OPO) with the
help of the RPO from organizational documents within the process ontology building
Fig. 1. Reference process ontology
Semantic Audit Application for Analyzing Business Processes
component (4). The first step of its algorithm is to identify process element of the RPO
(like Student, Coordinator as a specific Role) - excluding process steps – or
discover new ones within a given document and add them to the OPO as subclasses of
the appropriate super classes like Role, Document etc. New process elements are
discovered with the help of semantic text mining. The algorithm focuses on finding
patterns shaped into open queries. Relations are regarded as ordered pairs. The algorithm assumes that certain expressions can represent a given relation within the document e.g. produces_output (Process_step, Document) relation suggests that something
must be happen with a document e.g. it is submitted or signed. That’s why the algorithm wants to find x submit y pattern within the document, where y is a document. It
seeks “submit” term and collects few words after that. It adds this expression as
subclass of the Document class to the OPO.
The second step in the building of the OPO is to identify the process steps of RPO
within the document and connect them to the nearest process elements already existed
in the OPO. The algorithm seeks every terms of a given process step within each
sentences of the document and counts the hits. If the number of the hits is greater than a
given threshold, the identified process step will be added to the OPO as subclass of the
Process step class and it will be connected to other process elements identified
nearby (namely in a given radius of words) within the text.
Having created the OPO (5), its full or tailored version as a result of a DL Query
will be compared to the appropriate version of the RPO within the ontology matching
component. Its technical report is processed by a report generator to create a transparent
report for auditors which contains information about the number of tasks, filtered roles,
missing, unnecessary or common organizational process elements. Hence auditors can
discover areas requiring deeper investigations in the next phase when leaders are
interrogated by them (Fig. 2).
Fig. 2. Processes of the semantic audit application
I. Szabó and K. Ternai
This application is a Java application that uses the libraries of OWL API,
DLQueryExample and the SVN Repository of Protégé 4 OWL Diff. The following
section presents how this system works on the field of Internationalization of higher
education institutions.
4 Its Implementation on the Erasmus Mobility Field
An example run of this application will be showed in this section. It uses Student
Application process as standard process from the Erasmus Mobility Handbook. Erasmus
mobility calls represent the organizational documents. The following questions were
investigated during an audit procedure conducted in the life of some Hungarian higher
education institutions.
How effective are the current mechanisms? What kind of communication channels
exist between the various levels managing internationalization activities? How effective
are they? What are the missing functions of the internationalization units? Which units
are less efficient? Why? [9].
In the modeling phase reference process models have been formalized from
Erasmus+ Programme Guide1 and process models were implemented by using
BOC ADONIS modeling platform2. The reference business process model detailed
with the above mentioned parameters can be seen in Fig. 3.
The Reference Process Ontology transformed from this model and an Erasmus call
of a Hungarian higher education institution were used to present the applicability of this
solution in the respect of the next audit questions. These questions aim at investigating
the effectiveness of the current mechanisms.
Audit Question 1: Are the same role responsible for performing this process?
The answer requires filtering the ontology by Roles. The report created from the
technical report provided by the Protégé 4 OWL Diff ontology matching component
presents that the University was mentioned as role in the organizational document
instead of the Coordinator. In the Erasmus mobility call, it was mentioned that
“Qualified applicants will be invited to take the entrance examination organized by the
university”. It reveals such problem that who is the responsible person for organizing
this entrance examination (Fig. 4).
But we can state that Student was mentioned on both sides, hence we can try to
investigate the next audit question.
Audit Question 2: In what measure are the tasks performed by the same role
Semantic Audit Application for Analyzing Business Processes
ApplicaƟon of Erasmus Grant
Descrip on: Applica on of Erasmus Grant
RACI: Student (Erasmus Office)
Documents: Applica on data sheet
Systems: Online applica on system
Grant AllocaƟon
Descrip on: Financial Managership
RACI: Coordinator (Erasmus Office)
Documents: currency requirement blank
Fig. 3. The reference process model
To answer it, the ontologies were filtered by the ‘performed_by only (Student)’ DL
Query. The result is presented on Fig. 5.
This report shows that to sign of support contract is not a task of a student or it is
not mentioned in the organizational document. We found that the latter event
I. Szabó and K. Ternai
Fig. 4. Report of the role investigation
Fig. 5. Report about not mentioned tasks
Semantic Audit Application for Analyzing Business Processes
These reports revealed that a role and a task were missing in this Erasmus mobility
call, so this Student Application process do not comply with the requirement of the
Erasmus mobility handbook. The process is not effective, because students do not
know about their responsibility for singing the contract, so they will be informed in a
latter phase of this process which makes this process more slow. The auditor has to
investigate that the source of this problem is a document that does not reflect well the
process or the process itself.
5 Conclusion and Future Work
Nowadays Campus Mundi projects are to improve higher education processes in
Hungary. The audit guideline elaborated for investigating compliance checking of
Internationalization activities wants to detect “how the current mechanisms are effective”.
Our semantic audit application can help to compare institutional processes with standard
processes articulated in the Erasmus Mobility Handbook. The Student Application process was used to test this application. Erasmus mobility calls represent the organizational
documents. This test was executed on ten different sources.3 The first chart presents that
the algorithm identified at least one roles within each organizational documents. These
roles were mostly interpretable (like Student, Coordinator, and University etc.) except
only in the 6th case. We can state that the “by the” semantic rule being responsible for
identifying rules can be applied, because its false discovery rate is low.
The second chart shows that there is a notable differentiation between tasks performed by students within the institutions and according the handbook. It implies that
our algorithm identifies these tasks not well, so we have to improve it. Or different
institutions obligate students to perform such kind of tasks that are not mentioned in the
handbook, maybe these tasks belong to another role. This is the problematic of segregation of duties that must be investigated by the auditors (Fig. 6).
This application can be used to process business regulations semantically instead of
manually, that spares time for auditors. It provides a report that shows deviations within
managing business, if they exist. Auditors can use this knowledge to seek information
with more focus or ask managers relevant questions during the next phase of the audit
We can use more metrics to test the precision of the algorithm of this application.
The organizational process ontology (OPO) stores several text parts used to identify the
above-mentioned process elements. These text parts can be used to calculate hit rates
like false or true positive/negative rates. Based on these information we can improve
this algorithm.
The complexity of process models and the granularity of organizational documents
influence the scalability of this system. The ten above-mentioned materials were processed within similar time period, but they were not too large documents. Testing the
scalability of this system is a future work.
I. Szabó and K. Ternai
Fig. 6. Results of this application
Acknowledgements. The authors wish to express their gratitude to Dr. András Gabor, associate
professor of the Corvinus University of Budapest, for the great topic and the powerful help
provided during the development process.
“This work was conducted using the Protégé resource, which is supported by grant
GM10331601 from the National Institute of General Medical Sciences of the United States
National Institutes of Health.”
1. Braun, R.L., Davis, H.E.: Computer-assisted audit tools and techniques: analysis and
perspectives. Manag. Auditing J. 18(9), 725–731 (2003)
2. van der Aalst, W.M.P., ter Hofstede, A.H.M., Kiepuszewski, B., Barros, A.P.: Workflow
Patterns. Distrib. Parallel Databases 14(1), 5–51 (2003)
3. Hepp, M., Cardoso, J., Lytras, M.D.: The Semantic Web: Real-World Applications from
Industry. Springer, New York (2007). ISBN: 0387485309
4. Koschmider, A., Oberweis, A.: Ontology based business process description. In: Proceedings of the CAiSE, pp. 321–333 (2005)
5. Hepp, M., Roman, D.: An ontology framework for semantic business process management.
In: Proceedings of Wirtschaftsinformatik, Karlsruhe, 28 February–2 March 2007 (2007)
6. Maedche, A., et al.: Ontology learning part one — on discovering taxonomic relations from
the web. In: Zhong, N., et al. (eds.) Web Intelligence, pp. 301–319. Springer, Heidelberg
7. Alasoud, A., Haarslev, V., Shiri, N.: An effective ontology matching technique. In: An, A.,
Matwin, S., Raś, Zbigniew, W., Ślęzak, D. (eds.) ISMIS 2008. LNCS (LNAI), vol. 4994,
pp. 585–590. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68123-6_63
8. Otero-Cerdeira, L., et al.: Ontology matching: a literature review. Expert Syst. Appl. 42(2),
949–971 (2015)
9. Temesi, J.: Nemzetköziesítés a magyar felsőoktatási intézményekben, Audit materials (2013)
10. Velardi, P., Cucchiarelli, A., Pétit, M.: A taxonomy learning method and its application to
characterize a scientific web community. IEEE Trans. Knowl. and Data Eng. 19(2), 80–191
11. Maynard, D., Funk, A., Peters, W.: SPRAT: a tool for automatic semantic pattern-based
ontology population. In: Proceedings of the International Conference for Digital Libraries
and the Semantic Web (2009)
Semantic Audit Application for Analyzing Business Processes
12. Buitelaar, P., Cimiano, P., Magnini, B.: Ontology learning from text: an overview. In:
Ontology Learning from Text: Methods, Applications and Evaluation, pp. 3–12. IOS Press
13. Corcho, O., Fernández-López, M., Gómez-Pérez, A., López-Cima, A.: Building legal
ontologies with METHONTOLOGY and WebODE. In: Benjamins, V., Richard,
Casanovas, P., Breuker, J., Gangemi, A. (eds.). LNCS (LNAI), vol. 3369, pp. 142–
157Springer, Heidelberg (2005). doi:10.1007/978-3-540-32253-5_9
14. Cimiano, P., Völker, J.: Text2Onto–a framework for ontology learning and data-driven
change discovery. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol.
3513, pp. 227–238. Springer, Heidelberg (2005). doi:10.1007/11428817_21
15. Szabó, I.: Future development: towards semantic compliance checking. In: Gábor, A.,
Kő, A. (eds.) Corporate Knowledge Discovery and Organizational Learning. KMOL, vol. 2,
pp. 155–173. Springer, Heidelberg (2016). doi:10.1007/978-3-319-28917-5_7
16. Ternai, K.: Semi-automatic methodology for compliance checking on business processes. In:
Kő, A., Francesconi, E. (eds.) EGOVIS 2015. LNCS, vol. 9265, pp. 243–256. Springer,
Heidelberg (2015). doi:10.1007/978-3-319-22389-6_18
Using Application Ontologies
for the Automatic Generation of User
Interfaces for Dialog-Based Applications
Michael Hitz(&) and Thomas Kessel
Cooperative State University Baden-Wuerttemberg, Stuttgart, Germany
Abstract. The paper presents a data-centric, model driven approach for the
automatic generation of user interfaces (UIs) for dialog-based applications using
ontological descriptions. It focuses on Interview Applications, a common pattern
e.g. for self-service applications in EIS. Existing approaches for the automatic UI
generation usually rely on proprietary, UI-specific description models, designed
and developed manually for the application in focus. The manual creation of the
artefacts leads to a gap in the automated development, i.e. for dialog-based
application UIs, where the structure and behavior is driven by the processed data.
Furthermore, the UI specific nature of the artefacts impedes their (re-)use in
different contexts. The presented approach is a shift away from a UI-specific
towards a data-centric method of modelling dialog-based applications, bridging
this gap. Application-Ontologies are used as description means, which leads to
reusable, sharable model artifacts, applicable to different contexts of use.
1 Introduction
The ongoing digitalization of business processes in Enterprise Information Systems
(EIS) raised the need for exposing different variants of User Interfaces (UI) to let
different user groups interact with the systems in different contexts of use and supporting different platforms (e.g. as a desktop, mobile and web application for customers
or insurance brokers).
A commonly observed pattern in sales related dialog-based applications is to collect
data needed for the execution of a business process in a directed dialog in form of an
interview (a.k.a form filling or directed dialog [4]). Examples for these Interview
Applications can be found i.e. on the internet or web based business portals: e.g. the
booking of a flight, the money transfer in a banking portal or the request of a quote for
an insurance product. The application type is characterized by a high degree of standardization and clearly defined interaction concepts (mostly motivated by company
style guides or platform standards). Thus the UIs for this application type are well
suited for automatic generation.
Although there exist quite a lot of approaches for the model driven generation of
UIs (see Related Work), they are not widely used in practice [14]. Mostly they rely on
manually modelling UI specific artefacts (e.g. concrete UI descriptions, taskflow- and
related data models), which are closely coupled and thus complex to create and
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 16–31, 2016.
DOI: 10.1007/978-3-319-49944-4_2
Using Application Ontologies for the Automatic Generation
maintain. In addition, the UI specific nature and proprietary descriptions impedes their
reuse in different contexts [6]. Furthermore, the manual development of the artefacts
leads to a clear break and a gap in the automated application development – especially
for data-driven applications as the aforementioned Interview Applications, where the
structure and behavior of the UI is closely related to the data.
The objectives of this paper are to present an approach, that (1) bridges this gap by
pushing the description from proprietary UI-specific information towards a
data-centric model for applications and (2) to use common non-proprietary
descriptions (i.e. RDF/OWL Ontologies) as modelling means.
The approach is based on the thesis, that structure and behavior of UIs for Interview
Applications rely on the characteristics and semantics of the processed data. Thus UIs
could be derived from data descriptions that contain the relevant semantic information.
The proposed solution uses a single, declarative model of the processed data, which is
augmented by additional semantical information used to infer all necessary information
required to derive UIs.
The mayor benefits of this approach are to get (1) a single, UI agnostic artefact
related to the processed data that allows an improved automation for application variant
development and (2) a non-proprietary, shareable application description, applicable to different contexts of use.
The proposed approach in this paper contributes to the field of model driven
development of user interfaces for dialog based applications. It adds concepts for the
use of semantic knowledge about the processed data, its representation as ontology and
its use to derive UIs.
The paper is organized as follows: first the basic idea of a data-centric description,
the character of Interview Applications and the information required to automatically
generate user interfaces are elaborated. Then the representation of the information by
means of ontologies and its concepts is presented followed by an outline of the
derivation process for the UIs. Finally, evaluation of the concept and related work is
presented, followed by a conclusion.
2 Data-Centric Description of Interview Applications
The basic assumption of the data-centric approach is, that manually developed UIs for
Interview Applications are built on the characteristics and semantics of the data
processed by the application. Application developers use this knowledge to build
suitable frontends for the data – e.g. a reasonable selection, grouping and sequence of
input elements, the showing/hiding of sections or the navigation between pages [6].
The knowledge is used implicitly by the developer and based on his experience or other
rules, that are tacit knowledge. The basic idea is to incorporate this semantical
knowledge into a data-centric model along with the processed application data.
The following sections illustrate the character of Interview Applications and show
which information is needed to derive a UI.
M. Hitz and T. Kessel
Character of Interview Applications
Interview Applications collect related data in a meaningful sequential flow of questions
in a dialog with the user. Depending on already entered information the flow might
change and, if applicable, further questions are asked or omitted when necessary. The
following example illustrates the data-driven character of Interview Applications. It
contains most characteristics that need to be modeled in a data-centric description,
listed in the next section.
Example: quote for a liability insurance. The computation of a quote for a liability
insurance is chosen as a sample use case. Such calculators are a very common application type in the insurance domain and incorporate multiple interaction patterns
common for Interview Applications.
a) Customer data
b ) Product Configuara on
Fig. 1. UI for calculating a quote
Figure 1 shows an example of a graphical UI as used by an agent. The agent enters
successively data that is needed for computing a quote. In Fig. 1a) subsequent questions are asked concerning the customer (e.g. name and marital status). On the left side
there is a hierarchical navigation that allows random switching between various
question groups (e.g. customer and contract data).
The data input elements are chosen based on type related properties (e.g. the basic
type, value ranges, restrictions etc.). They are ordered in a semantically meaningful
manner (e.g. name information before marital status) and have a hierarchical relation
to each other (e.g. on the left hand navigation of Fig. 1a) contact information is shown
as part of customer data).
The processed data elements are semantically interrelated. This reaches from related
content (e.g. the zip code is related to a certain city) to existential relations, e.g. the
Using Application Ontologies for the Automatic Generation
date of marriage and partner data only exist if the marital status is set to married
(Fig. 1a, ❶❷). Additionally, dialogs show dynamical behavior: input is validated and
field content might need adjustment as a reaction to changes in other fields (e.g.
prefilling the city according to a given zip code or explicitly initiated user actions (e.g.
opening a customer database to prefill customer data, Fig. 1a, ❸).
Information Needs for Automatic Generation of UIs
In previous work (cf. [8, 9]) we derived a set of interaction patterns and extracted
information, that is needed to build UIs that behave as in the example above. This was
done by analyzing existing ‘real-life’ applications used in a major insurance company
and match the findings to existing work within the field of UI generation (e.g. [4, 20]).
The analysis leads to a set of information, needed to automatically derive UIs using
these patterns. The information can be grouped into two categories (cf. Table 1).
Table 1. Information needs for a datamodel and its use to derive UIs.
Type Related and Structural Information (I1–I4): This information is needed to
describe the data elements (i.e. types and type restrictions like ranges or allowed
values), their structure (i.e. grouping and hierarchical correlation) and a meaningful
temporal succession of the questions within the interview [5], which is based on the
semantical cohesion.
Behavioral Information (I5–I7): This is needed to model the dynamic, data-related
aspects of the UI for examination at runtime; i.e. conditions about the
existence/activation of elements/groups bound to the content of other data elements
within the model, the indication for complex validation, operations associated with
data elements and groups triggered on changes of the input data (reactions) or triggered
by the user (actions) [15, 20].
M. Hitz and T. Kessel
This set of information was found adequate to derive different aspects of the UI
following the interaction patterns in focus. Table 1 summarizes the usage of the
information within a derivation process which will be detailed in Sect. 4.
Based on the findings, a meta model was developed that incorporates the identified
information and served as a foundation to develop data descriptions for Interview
Applications. Figure 2 shows this meta model as UML diagram.
Table 2. Facets
Fig. 2. Metamodel in UML notation
A data description (DataDescription) consists of a succession of data groups
(DataGroup) that might contain an ordered list of further groups or data elements
(DataItem). This constellation allows to model the requested structural information
regarding cohesion, (hierarchical) grouping and temporal succession (order) of the
elements (I2, I3, I4). Groups and data items are detailed by attributes/facets. E.g. the
type information (I1) and existential and activation conditions (I5) can be specified
for each description element in the model. Further facets are used to specify the element
more precisely in terms of data related aspects i.e. type restrictions that are usually part
of a type system like XML-Schema (I1). Table 2 summarizes the semantics of the
facets for DataGroups and DataItems. Additionally, each description element might
have associated validation-, reaction- and action operations (I6, I7). These are
detailed by further facets (cf. Fig. 2) like a name for the operation, triggering events
Using Application Ontologies for the Automatic Generation
and references to model elements needed for the execution of the operation
(input/output parameters) [8]. Elements are referenced by an identifier in dotted
notation, pointing out their position in the model hierarchy to be retrieved at runtime
(cf. Sect. 4, step 3).
The resulting model addresses the first objective of this paper stated in Sect. 1: an
approach describing Interview Applications based on their processed data, augmented
by data related information, that can be used to derive UIs. To achieve the second goal –
using sharable and common means for the description – we apply this model to
RDF/OWL as a common language in the field of semantic web technologies.
3 Using Ontologies as Application Description
The objective is to describe the mapping of the information requirements (I1–I7) listed
in Sect. 2.2. towards RDF/OWL and hence develop an application ontology.
RDF/OWL [11] and its basic features are selected as a well understood, widely adapted
technology used in different contexts for which tooling is available (e.g. reasoners,
APIs). Table 3 summarizes the RDF/OWL features and concepts that are used to model
the identified information requirements (I1–I7).
Table 3. Mapping of information needs to RDF/OWL features
Ontologies in general are intended to describe entities, relationships, contained data
elements and additional facts in a way that inferences are built upon that knowledge.
Hence the mapping of most of the structural information as identified in Sect. 2.2 to
RDF/OWL is a straight forward task.
To illustrate the mapping, Listing 1 shows a simplified application ontology for the
customer data example in Sect. 2.11: DataGroups are modeled as owl:Classes within
an ontology and their hierarchical relations as owl:ObjectProperties (e.g. customerdata
as an object property of a Liability with range Customerdata). The Classes section
Due to the space restrictions of a paper, a complete example can be reviewed at https://doi.org/10.
M. Hitz and T. Kessel
declares the DataGroups (e.g. Customerdata, Contractdata and Fullname) as part of
the application ontology (i.e. an ontology for a liability insurance <http://…/
liability/v1#>). DataItems are defined likewise as owl:DatatypeProperties, containing
information to which class they belong to, along with basic type information (e.g.
Listing 1, exemplary data associated with an Address). Using these basic concepts, the
structural information of I2 and I4 and partially I1 are covered.
However not all of the identified information needed for UI generation can be
expressed out-of-the-box. Ontologies are made for knowledge representation and
therefore RDF/OWL does not contain information like sequence of data (I3), existential
conditions (I5) or functional aspects (I6, I7) in its basic language. To the best of our
knowledge, RDF/OWL does neither include a concept for the description of operations
nor for declaratively modelling conditions/references on instance data. To express the
information needed, we use the OWL annotation concept as used by [7, 12] to produce
a profiled ontology. This allows to incorporate the information declaratively and leads
to an application ontology, that is (1) still covered by basic RDF/OWL (and thus can be
used for standard reasoning) yet (2) exposes the additional information for reasoners
(e.g. UI generators) that do understand the profile.
Using Application Ontologies for the Automatic Generation
Table 4 lists the used annotations within the proposed profile along with their
mapping to the information needs. As an example, Listing 1 shows annotations for
type, sequence, validation and reactions applied to elements of the sample ontology.
The proposed mapping onto RDF/OWL constructs addresses the second objective
of this paper stated in Sect. 1: it leads to an ontological description for Interview
Applications. Hence it incorporates all the information contained in the meta model in
Sect. 2.2, UIs may be derived based on such an ontology (cf. Sects. 4 and 5). Hence it
uses a common language and describes the processed data for an application, it can be
used in different contexts and is not limited to UI generation. An example for a non-UI
use will be given in the evaluation section.
Nevertheless, the approach has limitations regarding its universality. The consequence of a profiled ontology using proprietary annotations is, there has to be a reasoner that is aware of the profile. The contained information is not interpretable to
general reasoners and thus it is not shared as ‘world knowledge’. The proposed solution
is consciously limited to hierarchical ontologies. This is not a restriction for Interview
Applications as they operate on hierarchical data structures by definition. But this
characteristic prevents the approach to be applied to arbitrary ontologies that might
have a reticular graph structure. Sahar et al. [21] address this problem in the context of
UI generation. The results found there may be used to extend the applicability of the
proposed approach in future work.
Table 4. Elements of the annotation profile
4 Derivation Process for UIs
As outlined in Table 1 (Sect. 2.2), the information contained in the proposed model is
used for the automatic derivation of UIs. Figure 3 outlines the derivation process. The
basic approach is based on the concepts of the CAMELEON framework as proposed by
Calvary et al. [3]. The starting point for the UI derivation process is an instance of the
data-centric application model (data-centric core model). It contains the description of
the processed data of the application according to the structure and properties presented
in Sect. 2.2.
M. Hitz and T. Kessel
Step 1: the core model is transformed into an abstract UI (AUI) using information
about the context of use to concretize the information contained in the data-centric
model. This step is crucial to generate usable UIs from a solely data-centric model
that intentionally omits technical details. This phase includes the enrichment with
labels, explaining texts and help information (depending on the language context),
the mapping of data types to concrete types of the AUI (e.g. the mapping of the
custom type zip to a text field restricted to 5 digits, if the language context is
German) and abstract UI input elements. For instance, a number range control for a
numerical value having min/max restrictions or a oneOfManySelection control for
elements restricted to a set of possible values. The information needed here is
derived from I1, I2, I3 and I4.
Step 2: derives a concrete UI (CUI) from the AUI description by incorporating the
device context for which the UI is intended. It includes the mapping of fields onto
pages (pagination) by using information about device restrictions (e.g. for mobile
devices) and exploiting the cohesion information contained in the data-centric
model. The latter indicates how a flow of questions may be split up and positioned
on pages for different device categories. The information needed here is derived
from I2 and I4.
Fig. 3. Derivation process for UI variants
Step 3: Depending on the technological context a final UI is derived by generating
concrete UI Widgets for the AUI controls. Besides, an access mechanism to user
entered data at runtime needs to be supplied, allowing the implementation of the
functional aspects, e.g. a model for evaluation of visibility in
Model-View-Controller application. The information for the functional aspects is
derived from I1, I5, I6 and I7.
Using Application Ontologies for the Automatic Generation
5 Demonstration and Evaluation
The following sections focus on the validation of the stated objectives to show that
(1) a data-centric approach may lead to an increased automation and is suitable for
generating UIs for Interview Applications, (2) ontologies can be used within the
approach to describe application UIs in a non-proprietary way that are (3) shareable and
thus applicable for different contexts of use.
Our research on the topic of data-centric application descriptions is conducted using
the Design Science Research (DSR) [18] and Action Design [22] approach. The
resulting artefacts (i.e. data-centric meta model, derivation process and ontological
description) are refined in several iterations and evaluated per iteration by implementation and technical experiments with prototypes – which is a commonly applied
technique for evaluation of algorithms and models [17].
The evaluation of the approach was conducted in association with a major German
insurance company (Allianz Deutschland AG) from which we drew the data for the
evaluation. The insurance company provided a set of typical ‘real-life’ Interview
Applications that were used for the analysis phase and the validation of the implementation. From this set, relevant applications were selected that cover the interaction
patterns identified during analysis and to demonstrate the usefulness of the automated
process and afterwards the ontology developed in this paper.
To allow a deeper investigation, an online-link2 is provided that lists sample
resources for the liability quote application used throughout this paper. It shows a
working example of application variants and the complete application models mentioned below.
Evaluation of the Viability of the Data-Centric Approach (1). First, the derivation
process outlined in Sect. 4 was implemented resulting in a Transformation Service
(exposed as RESTful webservice), which transforms a data-centric application
description to a final UI for different platforms. The implementation was based on
available components from previous work done by Hitz [8]. The implementation
Fig. 4. Basic setting for evaluation
Link to website with additional content: https://doi.org/10.13140/RG.2.2.16564.24963.
M. Hitz and T. Kessel
focused on web-based Interview Applications covering HTML/JavaScript UIs for
different device categories (mobile, desktop). In addition, another prototype for rich
client UIs using JavaFX was established recently.
Figure 4 shows the basic setting for the implementation. The aforementioned
selected applications were described as a first step by using a DSL (domain specific
language) as proposed in [8] (Fig. 4, upper left), which contains a model following the
proposed meta model (cf. Sect. 2.2). These models were imported into the data-centric
core model and used by the transformation service to produce final UIs for different
platforms as outlined in Sect. 4.
Results: The implementation of the transformation process and its application to
existing Interview Applications showed that the information outlined in Sect. 2.2 is
appropriate to derive non-trivial UIs for the identified interaction patterns in practice.
The functionality of the generated UIs corresponds to a large extent to the already
existing manually designed counterparts which served as a basis for the analysis. It
could be demonstrated, that a single artefact is sufficient to model Interview Applications and that the data-centric approach leads to a high degree of automation for
generating different variants for the applications (e.g. different technologies, navigation,
input styles).
However, limitations could be observed in situations, where the outlined model did
not contain enough semantical information for a selection of sophisticated widgets for
the generation of the final UI. For example, the use of a selection panel using buttons
instead of a dropdown box depends on the character of the question (like ‘product
component selection’). This issue was solved by extending the model with additional
properties like semantical tags for an element.
The results of this implementation are already used in production environments of
Allianz Deutschland AG, e.g. to dynamically generate the UIs of complex electronic
risk acceptance check applications for different products on customer and agent portals.
Applicability of Ontologies (2). To evaluate the applicability of the approach to
ontologies, a comparative evaluation was chosen based on the implementation of the
first step (Fig. 4). The goal was to demonstrate that the proposed ontology has the same
expressive power as the DSL used in the first step and thus produces the same output.
To achieve this, the same applications were modeled using the proposed ontology (cf.
Sect. 3) and an import module was implemented, that mapped the ontology contents to
the core model of the transformation service (Fig. 4, lower left). This was used to
generate final UIs, which were compared to the ones generated in the first step.
Results: The results clearly show that both kinds of description can be mapped to the
same core model and bear the same expressive power. The implementation showed,
that the proposed approach for using ontologies to describe Interview Applications
leads to the same results as the solution using the proprietary DSL used in [8]. Though
it is no formal proof, the result clearly indicates that the data-centric approach may be
applied to ontological descriptions of Interview Applications.
Ontology Based, Shareable and Reusable Application Descriptions (3). For the
suitability of the proposed ontology as shareable, reusable application descriptions, we
Using Application Ontologies for the Automatic Generation
applied the approach to a concept for distributed marketspaces working with generic
UIs for the specification of complex products. This work is already published in [10]
and thus summarized here. The objective was to show, that application ontologies as
proposed above can be (1) shared and used to generically build composed UIs and
(2) can be used for non-UI-specific purposes – in this case to deduce a complex product
request from the user input, that is an instance of the applications data model represented by the ontology.
For this purpose, a demonstrator was implemented that used the aforementioned
results. Figure 5a shows the basic architecture of the demonstrator. As generic user
frontend a Complex Product Builder (CPB) application was implemented, that let the
user search and select Application Ontologies (AO) as proposed in this paper (Fig. 5a,
❶). These are drawn from a shared UI description repository containing arbitrary UIOs
for different product components (e.g. the booking of a concert or flight). The user
selected AOs for his demand are sent to the Transformation Service (Fig. 5, ❷), which
returns the generated UI partials for each AO. These are aggregated into a UI presented
to the user (Fig. 5b). Since the UI partials are generated from the elements contained in
the AO, the user input clearly relates to the corresponding ontology elements. This
allows to build an instance model for each presented AO containing the input data of
the user using an Ontology Mapper (Fig. 5, ❸). The result is a set of ontology instances
on which a reasoner can build inferences and derive a complex product request, which
can be sent to the Distributed Market Space for further processing (i.e. generating a
quote/proposal for the requested product components).
a) Basic Architecture (modified from [10])
b) sample UI for a travel booking
Fig. 5. Generic UIs for complex product requests
M. Hitz and T. Kessel
Results: Although the demonstrator is yet still a proof-of-concept, it already showed,
that it is possible to share application descriptions and thus provide generic UIs based
on Application Ontologies. AOs can be assembled from arbitrary sources (e.g.
topic-related repositories for insurance, travel planning etc.). The UIs can automatically be derived and aggregated based on the contained information. As a second result
the demonstrator showed, that these ontological descriptions can be used for
non-UI-specific purposes like reasoning on instances of the AOs and thus further
processing in backend systems that do understand the used ontologies.
6 Related Work
The research on the automatic generation of UIs covers many contributions during the
last years that are based on model-driven concepts. There are several approaches
focusing on different aspects of UI generation.
User Interface Description Languages (UIDL) focus mainly on the description of
concrete UIs in a technology independent way. Examples are JavaFX3, UIML [1],
UsiXML [13] and XForms4. The essential idea is to model dialogs and forms by using
technology independent descriptions of in-/output controls and relations between elements and behavior (e.g. visibility) within a concrete UI.
Task-/conversation based approaches describe applications by dialog flows
which are derived from task models – e.g. CAP3 [23], MARIA [16] and conversation
based approaches by Popp et al. e.g. [19]. They focus on a concrete model of the dialog
flows and their variants. To generate an application frontend, the steps in a dialog flow
are associated with technology independent UI descriptions displayed to the user.
Data-centric approaches can be found in JANUS [2] and Mecano [20] which use a
domain model as starting point for the derivation of UIs. While JANUS is designed to
provide CRUD-like interfaces that work on a persisted domain model that does not
support much dynamics in the UI, Mecano adds these aspects to its description.
Existing Ontology based approaches generally rely on the concepts of the mentioned approaches and use ontologies to represent the information about concrete UIs.
For instance, in analogy of UIDL approaches, Liu et al. [24] propose an ontology driven
framework to describe UIs based on concepts stored in a knowledge base. Khushraj and
Lassila [12] uses web service descriptions to derive UI descriptions based on a UI
ontology, adding UI related information to the concept descriptions (profile). In analogy
with task based approaches, Gaulke and Ziegler [7] use a profiled domain model
enriched with UI related data to describe a UI and associate it with an ontology driven
task model. ActiveRaUL [21] combines an UIDL with a data-centric approach and
makes a significant contribution to the generation of UIs based on arbitrary ontologies.
They derive a hierarchical presentation of an ontology and map it to an ontological UI
I. Fedortsova et al. 2014: http://docs.oracle.com/javase/8/javafx/fxml-tutorial/preface.html.
M. Dubinko et al. 2003: http://www.w3.org/TR/xforms.
Using Application Ontologies for the Automatic Generation
description. Since there is not much semantic information contained, the resulting UIs
are yet very simple and not very feature rich regarding the supported interactions.
Dissociation: A main goal of the proposed data-centric approach is to minimize the
number of needed artefacts and to use a representation that can be reused for different
purposes. The models of the aforementioned approaches usually do not contain enough
semantical information for reasoning that could be used for deriving UI variants. The
UIs are manually modeled using a large amount of artefacts. This opens a gap in
automating the process for building UIs. In addition, the produced artefacts are usually
proprietary and UI-specific. That impedes their reuse for other purposes related to
application generation.
The solution proposed in this paper is based on the application’s processed data and
enriches its model by additional semantics. This leads to a single, central description
for the application that serves as a knowledge base for the automatic derivation of UI
variants. The data-centric approach allows the reuse of the model in different contexts
and - by using a non-proprietary representation for the model - the sharing and integration into different environments. Though this approach is restricted to interview
applications, it allows a significantly simplified modelling process, since the results can
be derived from a single source.
7 Conclusion
In this paper a data-centric, model driven approach for the automatic generation of
user interfaces for dialog based Interview Applications is presented. The approach is
based on a UI-agnostic, data-centric description for applications. The foundation is a
model of the processed application data which is enhanced by type-related, structural
and behavioral information to yield automatically generated UI variants as demonstrated in the previous sections. The information needs are identified and a meta-model
is derived from which the UIs can be inferred. Furthermore, the information needs are
mapped to an ontological description relying on RDF/OWL constructs to get a
non-proprietary representation of that information to be used in different contexts.
A process to derive UIs from such a data-centric model is outlined. Finally, the
evaluation is presented which (1) provides an implementation of the generation process
for UIs from data-centric application descriptions, is used as proof-of-concept
regarding the (2) usefulness of the ontological descriptions for UI generation and
(3) its viability as sharable, non-proprietary means for generating UIs for data-driven
The results of the evaluation indicate, that using a data-centric model is feasible for
UI generation in case of Interview Applications. Since the number of artefacts is
reduced to a single, UI-agnostic application model, the step for declaring UIs manually
can be eliminated. Because of its data-centric nature, it can be used for non-UI-specific
tasks. Using a universal representation as RDF/OWL adds even more value, as the
application model is sharable and the contained semantics can be exploited by standard
tools for reasoning on the model and instances.
M. Hitz and T. Kessel
The approach is intentionally restricted to dialog based Interview Applications that
are very important and frequently used in EIS, e.g. in the insurance domain. Since a
limited set of applications was used for the analysis, we cannot claim completeness of
the identified interaction patterns. The practical use of the approach will bring forth
additional interaction patterns extending the basic information set in future. Regarding
the proposed use of ontologies, the evaluation strongly indicates the usefulness for UI
derivation – though it is restricted to hierarchical structures and uses proprietary
annotations and thus restricting its universality. Future work might concentrate on
finding more general ways for incorporating the information and exploit existing
approaches to apply the approach to arbitrary ontologies.
1. Abrams, M., Phanouriou, C., Batongbacal, A.L., Williams, S.M., Shuster, J.E.: UIML: an
appliance-independent XML user interface language. In: WWW 1999 Proceedings of the
Eighth International Conference on World Wide Web, pp. 1695–1708 (1999)
2. Balzert, H., Hofmann, F., Kruschinski, V.: The JANUS application development environment—generating more than the user interface. Comput. Aided Des. User Interfaces 96,
183–206 (1996)
3. Calvary, G., Coutaz, J., Thevenin, D., Limbourg, Q., Bouillon, L., Vanderdonckt, J.:
The CAMELEON Reference Framework (2002)
4. Chlebek, P.: User Interface-orientierte Softwarearchitektur. Vieweg & Sohn Verlag,
Wiesbaden (2006)
5. Constantine, L.L., Lockwood, L.A.D.: Software for Use: a Practical Guide to the Models and
Methods of Usage-Centered Design. ACM Press/Addison-Wesley Publishing Co.,
New York (1999)
6. Coutaz, J.: User interface plasticity: model driven engineering to the limit! In: EICS 2010
Proceedings of the 2nd ACM SIGCHI Symposium on Engineering Interactive Computing
Systems, pp. 1–8 (2010)
7. Gaulke, W., Ziegler, J.: Using profiled ontologies to leverage model driven user interface
generation. In: Proceedings of 7th ACM SIGCHI Symposium on Engineering Interactive
Computing Systems, EICS 2015, pp. 254–259 (2015)
8. Hitz, M.: mimesis: Ein datenzentrierter Ansatz zur Modellierung von Varianten für
Interview-Anwendungen. In: Nissen, V., Stelzer, D., Straßburger, S., Fischer, D. (eds.)
Proceedings - Multikonferenz Wirtschaftsinformatik (MKWI) 2016, pp. 1155–1165 (2016)
9. Hitz, M.: Interner Projektbericht zu mimesis.ui., DHBW-Stuttgart (2013)
10. Hitz, M., Radonjic-Simic, M., Reichwald, J., Pfisterer, D.: Generic UIs for requesting complex
products within distributed market spaces in the internet of everything. In: Buccafurri, F.,
Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817,
pp. 29–44. Springer, Heidelberg (2016). doi:10.1007/978-3-319-45507-5_3
11. Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S.: OWL 2 Web
Ontology Language Primer. http://www.w3.org/TR/2009/REC-owl2-primer-20091027/
12. Khushraj, D., Lassila, O.: Ontological approach to generating personalized user interfaces
for web services. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005.
LNCS, vol. 3729, pp. 916–927. Springer, Heidelberg (2005). doi:10.1007/11574620_65
Using Application Ontologies for the Automatic Generation
13. Limbourg, Q.: USIXML: a user interface description language supporting multiple levels of
independence. In: Matera, M., Comai, S. (eds.) ICWE Workshops, pp. 325–338. Rinton
Press (2004)
14. Meixner, G., Paternò, F., Vanderdonckt, J.: Past, present, and future of model-based user
interface development. i-com Zeitschrift für interaktive und kooperative Medien 10(3), 2–11
15. Miguel, A., Faria, J.P.: Automatic generation of user interface models and prototypes from
domain and use case models. In: Matrai, R. (ed.) User Interfaces. InTech (2010)
16. Paterno, F., Santoro, C., Spano, L.D.: MARIA: a universal, declarative, multiple
abstraction-level language for service-oriented applications in ubiquitous environment.
ACM Trans. Comput. Interact. 16, Article No. 19 (2009)
17. Peffers, K., Rothenberger, M., Tuunanen, T., Vaezi, R.: Design science research evaluation.
In: Peffers, K., Rothenberger, M., Kuechler, B. (eds.) DESRIST 2012. LNCS, vol. 7286,
pp. 398–410. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29863-9_29
18. Peffers, K., Tuunanen, T., Gengler, C.E., Rossi, M., Hui, W., Virtanen, V., Bragge, J.: The
design science research process: a model for producing and presenting information systems
research. In: Proceedings of Design Science Research in Information Systems and
Technology DESRIST 2006, vol. 24, pp. 83–106 (2006)
19. Popp, R., Falb, J., Arnautovic, E., Kaindl, H., Kavaldjian, S., Ertl, D., Horacek, H., Bogdan,
C.: Automatic generation of the behavior of a user interface from a high-level discourse
model. In: Proceedings of the 42nd Annual Hawaii International Conference on System
Sciences, HICSS (2009)
20. Puerta, A.R., Eriksson, H., Gennari, J.H., Musen, M.A.: Beyond data models for automated
user interface generation. In: Proceedings British HCI 1994 (1994)
21. Sahar, A., Armin, B., Shepherd, H., Lexing, L.: ActiveRaUL: automatically generated web
interfaces for creating RDF data. In: Proceedings of the 12th International Semantic Web
Conference, ISWC 2013, vol. 1035, pp. 117–120 (2013)
22. Sein, M.K., Henfridsson, O., Rossi, M.: Action design research. MIS Q. 35, 1–20 (2011)
23. Van den Bergh, J., Luyten, K., Coninx, K.: CAP3: context-sensitive abstract user interface
specification. In: Proceedings of the 3rd ACM SIGCHI Symposium on Engineering
Interactive Computing Systems - EICS 2011, pp. 31–40 (2011)
24. Liu, B., Chen, H., He, W.: Deriving user interface from ontologies: a model-based approach.
In: Proceedings of International Conference on Tools with Artificial Intelligence ICTAI
2005, pp. 254–259 (2005)
Semantic-Based Recommendation Method
for Sport News Aggregation System
Quang-Minh Nguyen, Thanh-Tam Nguyen, and Tuan-Dung Cao(&)
Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi, Vietnam
[email protected], [email protected],
[email protected]
Abstract. News on the Internet today plays an important role in helping people
access daily information around the world. News aggregators are websites that
collect and provide content from different sources in one location for easy
viewing. However, the increasing number of news on the Internet makes it difficult for readers when they desire to access news they are concerned. One solution
to this issue is based on employing recommender systems. In this research, we
propose a novel method for news recommendation based on a combination of
semantic similarity with content similarity between news and implement it as a
feature of semantic-based news aggregators BKSport. Experimental results have
shown that, a combination of both kind of similarity measures will result in better
recommendation than when using either measure separately.
1 Introduction
The development of the Internet has brought a sharp increase in the number of news
websites and the Web becomes a popular platform for broadcasting news. News
aggregators are websites that collect news from various sources and provide an
aggregated view of the events taking place in all over the world. Unfortunately, a critical
issue of news aggregation systems is that large number of daily published news
obstructs readers when they want to find the ones relevant to their particular interests.
A possible solution to this problem is the use of recommender systems as they can
traverse the space of choices and predict the potential usefulness of news for each reader.
There have been many researches on news recommendation methods which are
based on a certain similarity measure, probably similarity between news with each
other, known as Global Recommendation System (GRS), or similarity between personal
interests of readers and news, known as Personal Recommendation System (PRS) [2,
5]. In GRS, news recommended are news with the highest similarity with news that
readers are reading. On the other hand in PRS, news recommended for readers are news
with the highest similarity with personal interests of readers, which is modeled based
on the history of posts that readers have read. Collaborative filtering (CF) is a widely
applied technology in PRS development. With explosion of news on the Web,
designing novel approach for effective new recommendation to suggest news closer
and more relevant to readers is still a matter of concern. In this research, we focus on
proposing a news recommendation method according to just global recommendation
system model by enhancing results from existing works.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 32–47, 2016.
DOI: 10.1007/978-3-319-49944-4_3
Semantic-Based Recommendation Method
The most important task in developing GRS systems is to build a model to calculate
similarity between news. Recent research works on news similarity measuring center on
two prominent approaches: content-based similarity and semantic-based similarity. In
content-based approach, similarity of news is calculated based on vocabulary statistics
appeared in content of news and almost all recommended news only focus on a subject that
target news is about. In contrast, in semantic-based approach [1], similarity of news is
usually based on a knowledge base available to exploit semantic relationship between
elements appeared in these news. Therefore, recommended news will likely expand the
subjects than that of content-based approach. Both approaches have some weaknesses
limit, which limit their effectiveness in news recommendation. Our approach is a hybrid
one in the sense that it combines content-based recommendation and semantic-based
recommendation. In concrete, similarity of news is a linear combination of content-based
similarity and semantic-based similarity. The experimental results indicate that this combination brings news results suggest more effective than using either measure separately.
This work is in a part of development research of News Aggregation System
BKSport [11] that is based on Semantic Web technology, aiming to effectively handle
the amount of sports news gathered from various sources on the internet. Therefore, it
inherits results obtained in our previous research such as ontology and knowledge base
in the sport domain, methods for named entity recognition and semantic relationships
extraction between entities in the news.
The rest of the paper is organized as follows. Section 2 describes previous works
related to measuring semantic similarity between news. Section 3 presents more in
details of our proposed method. In Sect. 4, we present the experiments and the evaluation we performed using the implementation of the proposed recommender. Subsequently, advantages and disadvantages of this method, as well as corrective measures
and future research lines are concluded in Sect. 5.
2 Related Work
Traditionally, many content-based recommenders [7, 9] use term extraction methods
like TF-IDF (Term Frequency-Inverse Document Frequency [10]) in conjunction with
the cosine similarity measure in order to compare the similarity between two documents. TF-IDF is used to measure the importance of a word in a document based on its
frequency of occurrence in the entire document dataset (or corpus). After calculating
TF-IDF value for each word in document, this metric is combined with Cosine measure
or Jacard measure to calculate similarity between two documents.
TF-IDF value of the word appeared in document is calculated by the following
TF IDFij ¼ TFij IDFj
In which:
TFij ¼ P
k nkj
and IDFi ¼ log
jfd : ti 2 dgj
Q.-M. Nguyen et al.
nij is number of occurrences of the word i in document j and jDj is total number of
document in the dataset.
Then, document is represented as a vector Vi obtaining N dimensional vector (With
N is the size of dictionary), value of each element of vector is TF-IDF value of the
word. If the word in the dictionary does not belong to news, value of corresponding
element in the vector is 0.
In semantic-based approach, previous studies have explored relationship between
components between news with each other to calculate semantic similarity. In the study
carried out by Batet et al. [4], a measure based on the exploitation of the taxonomical
structure of a biomedical ontology is proposed for determining the semantic similarity
between word pairs. Method proposed by Capelle et al. [6] exploited element of
similarity between components (words or named entities) in news thereby calculating
similarity between two news. To measure the similarity between two components, their
proposed method relies on:
– WordNet Dictionary tree when components are words - denoted by simSS
– PMI measure when components are named entities – denoted by simBing .
This measure relates to the statistical frequency of occurrence of components and
co-occurrence between them
Final formula combines two simSS and simBing measures to calculate semantic
similarity between two news as follows (a is correction parameter):
simBingSS ¼ a simBing þ ð1 aÞ simSS
Also exploiting the relationship between components in two news with each other,
Frasincar et al. [8] presented a number of news recommendation methods in
semantic-based approach. Similar to Capelle [6], their work aims to a personalized
recommendation system. However user profile of the reader is also built based on the
news that the reader has read and calculating similarity between user profile and a news
is the same as calculating similarity between two news. Methods presented in this
research used ontology and knowledge base to exploit semantic relationship between
concepts, which are classes in the ontology. Experiment showed that Ranked Semantic
Recommendation 2 is the most effective among them. However, it remains certain
limitations that we will show in the following parts and propose method to overcome.
3 Similarity Between News Items
There are two main approaches in calculating similarity between text news items as
content-based and semantic-based. Each approach has its own advantages and disadvantages. We aim to combine these two approaches by combining content-based
similarity measure and semantic-based similarity measure with the expectation to
overcome limitations of each approach, making recommendation more effective.
Semantic-Based Recommendation Method
Semantic-Based Similarity
To calculate semantic similarity, we exploit mutual semantic relations between components in news item. These relations are determined based on ontology and knowledge
base that we have built. We extract and analyze components in the news items including:
entities, types of entities and semantic annotations. The next sections will present how to
exploit these components in calculating semantic similarity between news items.
3.1.1 Semantic Relation Between Entities
Specifically, in order to exploit relations between entities for calculating similarity
between news items, we extend Ranked Semantic Recommendation 2 method as
approved by Frasincar et al. [8]. In this method, the authors also used ontology and
knowledge base to exploit the relations between entities. However, the method remains
some limitations such as:
– It only considers direct relations between entities without considering indirect
– It does not consider the importance of entities as they appear in various positions in
the news item (title, description, etc.)
To overcome these above limitations, in Sect., we present a method to
calculate the relation weight between entities based on ontology and knowledge base.
In addition, we combine the statistical method of co-occurrence of entities in the same
news items in determining relation weight between entities, which is presented in
Sect. Finally, we present the method in which uses relation weights between
entities in determining semantic similarity between news items in Sect. Relation Weight Between Entities Based on Ontology and Knowledge Base
Aleman-Meza et al. presented the methods to calculate the ranking of Semantic
Association based on Semantic Path between the two entities in order to determine the
relation weight between entities [3]. Specifically, they define Semantic Association and
Semantic Path as follows:
Definition: if two entities e1 and en can be connected together by one or more
sequences e1 ; P1 ; e2 ; P2 ; e3 ; P3 ; . . .; en1 ; Pn1 ; en in an RDF graph; here, ei , 1 i n, is
entities and Pj , 1 j n is relations in ontology, then we say there exists semantic
relation between e1 and en .
Sequence e1 ; P1 ; e2 ; P2 ; e3 ; P3 ; . . .; en1 ; Pn1 ; en is a Semantic Path.
For example, in the knowledge base, we have:
– <Lionel-Messi> <playFor> <Barcelona-FC>.
– <Luis-Suarez> <playFor> <Barcelona-FC>.
Then, there exists a semantic path between two entities Lionel Messi and Luis
Suarez as follows:
<Lionel-Messi> ! <playFor> ! <Barcelona-FC> ← <playFor>
← <Luis-Suarez>
Q.-M. Nguyen et al.
As a result, there exists a semantic relation between Lionel Messi and Luis Suarez.
Based on the properties of semantic path, we identify a path rank value to show the
relation weight between two entities at both ends of the path. Because there might be
multiple semantic paths between two entities, we get the highest path rank value to
represent relation weight. Aleman-Meza et al. [3] used four characteristics of a
semantic path to calculate path rank, corresponding to four following weights:
– Subsumption Weight: based on the structure of the ontology to determine component weight for each component (predicate and entity) in the path, thereby calculating weight for the whole path.
– Path Length Weight: based on length of the path.
– Context Weight: based on determining which region each component of the path
belongs to in the ontology. Each region in the ontology has a separate weight
depending on the user’s interests.
– Trust Weight: based on weights of the properties in the ontology.
Applying in news recommendation in football, we found that Path Length Weight
and Trust Weight are two meaningful and appropriate weights. For this reason, we only
use these two weights to determine path-rank of a semantic path.
Path Length Weight. Length of a semantic path e1 ; P1 ; e2 ; P2 ; e3 ; P3 ; . . .; en1 ; Pn1 ;
en is the number of entities and relations in the path (exclude e1 and en ). We can see
that, when two entities remain indirect relation with each other through which the more
there are entities and relations, the lower similarity between these two entities is.
Consequently, path-rank of a semantic path must be inversely proportional to the
length of that path.
The Path Length Weight is defined in [3] as below:
Wlength ¼
In which: lengthpath is the length of semantic path.
For example, we have two semantic paths:
– P1 : <Lionel-Messi> ! <playFor> ! <Barcelona-FC> ! <competeIn> ! <LaLiga> ← <competeIn> ← <Real-Madrid> ← <playFor> ← <Karim-Benzema>
– P2 : <Lionel-Messi> ! <playFor> ! <Barcelona-FC> ← <playFor> ← <LuisSuarez>
P1 has length of 7, we obtain:
Wlength ðP1 Þ ¼
lengthpath 7
P2 has length of 3, we obtain:
Wlength ðP2 Þ ¼
lengthpath 3
Semantic-Based Recommendation Method
From there, we can see that similarity between Lionel Messi and Luis Suarez is
higher than that between Lionel Messi and Karim Benzema.
Path Relation Weight. There are many different relations defined in the ontology. Every
relation represents a different meaning therefore also represents a different relation weight
between entities. Some relations show close association, some other relations express
loose association. For example, we have two triplets in the knowledge base as below:
– <Luis-Enrique> <managerOf> <Barcelona-FC>.
– <Luis Suarez> <playFor> <Barcelona-FC>.
Here, there exist two relations which are relation <managerOf> and relation
<playFor>. We can see that, relation <managerOf> shows more closer than relation
<playFor>, because each team has only one single manager at a certain time; however,
may have a lot of players. Therefore, we assign weight of <managerOf> higher than
<playFor>. And for this reason, from above triplets, we conclude <Barcelona-FC> has
higher similarity with <Luis-Enrique> than <Luis Suarez>.
Weight of relations is in the range ð0; 1. Path Relation Weight of an overall path P
is defined in [3] as below:
Wpredicate ¼
Relation Weight Between Two Entities is Based on Ontology and Knowledge
Base. Combining two weights Wlenght and Wpredicate by a pair of coefficients awl and
awp , we define the path rank of a semantic path as below:
Wpath ¼
Wlength awl þ Wpredicate awp
awl þ awp
Value Wpath in the above formula is also similarity value between two entities based
on ontology and knowledge base. Relation Weight Between Entities Based on Statistics of Co-occurrence in the
Same News Items
According to the idea of the Capelle et al. on PMI measure [6], if two entities co-occur
in the same news items many times; these two entities have high similarity to each
other. We count co-occurrence of named entity pairs in a dataset on football news to
calculate weights PMI. The formula is defined as below:
cðe1 ;e2 Þ
cðe2 Þ
WPMI ðe1 ; e2 Þ ¼ log cðe
Q.-M. Nguyen et al.
In which:
– N. is the number of news items available in the dataset.
– cðe1 ; e2 Þ. is the number of news items in the dataset that two entities u and r
– cðe1 Þ is the number of news items in the dataset containing entity e1 , and cðrÞ l is
the number of news items in the dataset containing entity e2 .
As such, for each any entity pair, we have two values calculate relation weights:
Weight Wpath (calculated based on semantic path) and weight WPMI (calculated based
on statistics of co-occurrence of entity pairs). Before combining these two weights with
each other, we normalize them as below:
wnew ¼
wold MIN
In which: MAX and MIN corresponding are maximum value and minimum value in
the value chain w:
Finally, we combine these two values together by a pair of coefficients bpath and
bPMI to calculate similarity of each entity pair as below:
Similarityentity ðe1 ; e2 Þ ¼
Wpath bpath þ WPMI bPMI
bpath þ bPMI
By convention, when e1 e2 then Similarityentity ðe1 ; e2 Þ ¼ 1. Method for Calculating Similarity Between News Items Based on Relation
Between Entities
First of all, we define set of entities related to entity r is a set containing entities that
have similarity where r is greater than 0 and denoted as below:
RðrÞ ¼ fr1 ; r2 ; r3 ; . . .; rn g
Suppose there is a news item A, set of recognizable named entities in news item A
is denoted as below:
A ¼ fa1 ; a2 ; a3 ; . . .; am g
With each entity ai in set A, we build a set of entities related to ai corresponding to
Rðai Þ ¼ fai1 ; ai2 ; ai3 ; . . .; aik g. Grouping all sets Rðai Þ together (i : 1 ! m), we obtain
set of all entities not included in A, but related to A:
Rðai Þ
Semantic-Based Recommendation Method
Finally, we group two sets A and R to obtain set AR called as expansion set of news
item A:
AR ¼ A [ R
In the next step, we calculate ranking value for each entity in the set AR . Each rating
value will characterize the relevance of the entity corresponding to news item A. These
ranking values should satisfy some properties:
– (1) If the more times an entity appears in the news item, the greater that entity’s
ranking value is.
– (2) If the greater of entities in the news item that an entity is relevant to, the greater
that entity’s ranking value is.
– (3) Ranking value also depends on appearance position of the entity in the news
Regarding property (3), we determine an entity that can appear in the different
positions of the news item, as follows: title, description, bolder-text (bold text, image
title, etc.) and content. We also identify importance weight for these positions
respectively as below:
Wtitle [ Wdescription [ Wboldertext [ Wcontent
To calculate the ranking value for each entity in the set AR , based on Ranked Semantic
Recommendation 2 technique [8], we also represent entities in a matrix, in which the
first row represents entities in the set AR and the first column represents entities in the
set A. Matrix takes the following form:
In above matrix, we calculate the value hij as below:
hij ¼ similarityðai ; ej Þ WEðai Þ
In which WEðai Þ is importance weight of the entity ai in the news. This weight is
calculated as follows: Suppose ai is an entity appeared in the news item, and
Ntitle ; Ndescription ; Nboldertext ; Ncontent are respectively numbers of occurrences of ai in the
title, description, bolder-text and content of the news item. We define the importance
weight of entity ai as below:
WE ðai Þ ¼ Ntitle Wtitle þ Ndescription Wdescription
þ Nboldertext Wboldertext þ Ncontent Wcontent
Q.-M. Nguyen et al.
Finally, as the formula defined in [8], the ranking weight of each entity ej in the set
AR is calculated by:
Rank ej ¼
Assume VA is a vector containing above calculated Rankðei Þ values. We normalize
values of each element in VA in the range [0, 1]. Normalization formula is expressed as
vi ¼
vi MIN
In which MAX and MIN are maximum value and minimum value respectively of
elements in vector VA . If MAX ¼ MIN 6¼ 0 then vi ¼ 1; with every value of i.
As a result, taking all the steps above will obtain a vector for each news. Final step
is calculating similarity between any two news based on their vectors.
Suppose we have two news A, B and two corresponding vectors VA , VB . Because
these two vectors can have different number of dimensions, we define the similarity
between two vectors VA ; VB (also similarity between two news A and B) as a variation
of cosine similarity as below:
Pea 2A;eb 2B
va vb
ea eb
ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
similaritybasedentity ðA; BÞ ¼ cosineðVA ; VB Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 Peb 2B 2
ea 2A 2
In which va ; vb corresponding are values Rankðea Þ; Rankðeb Þ in vectors VA , VB .
3.1.2 Types of Entities Appeared in the News Items
A reader who is interested in a subject is more likely to be also interested in other subjects
of the same type. For example, if a reader is reading the news about football teams, then
that reader tends to continue reading other news items about football teams rather than
news items about players or stadiums. Therefore, if two news items have similarity in the
types of entities, similarity of these two news items will be higher (Fig. 1).
Fig. 1. An example of similarity between news based on types of entities in the news
Semantic-Based Recommendation Method
In ontology, each named entity is defined in the knowledge base will belong to a
certain object class defined. These classes can be regarded as the type of entity. For
example, two entities Lionel Messi and Luis Suarez in the knowledge base have the
same type, because they belong to class FootballPlayer; however, both are not the
same type with entity Barcelona-FC because this entity belongs to FootballTeam.
Statistics of entity types appeared in the news items is similar to statistics of
entities. Two different entities can be of the same type. Appearance position of entities
also affects association weight between entity type and corresponding news item. These
weights will be calculated based on appearance frequency and appearance position of
entities of that type. Suppose, we calculate association weight for entity type C for a
news item A. Given that ci is entities of class C appeared in news item A, we define the
association weight of entity type C with news item A as below:
WC ðC Þ ¼
WEðci Þ
We build a vector for news item with elements as WC weights similar to building
vector based on entity in Sect. Elements in each vector will be normalized
before using variations of the formula for calculating similarity between vectors used in
Sect. This value is denoted by similartiybasedtype .
3.1.3 Semantic Annotations of the News Items
Semantic annotations here are triplets in the form of <subject> <predicate> <object>.
In which subject and object are two entities. These semantic annotations also play an
important role because they represent somewhat content that news item is talking about
(Fig. 2).
Fig. 2. An example of similarity between news items based on semantic annotations of news
A news item may contain many triplets and a triplet may appear several times.
Triplets appeared several times in the news item will be important triplets, showing main
contents that news item mentions. Moreover, appearance position of these triplets in the
news item also expresses their importance. The importance of positions in the news item
(title, description, bolder-text, content) is similar to that presented in the previous section. The more common triplets of two news items, the higher their similarity is.
With each triplet, we denote Ntitle ; Ndescription ; Nboldertext ; Ncontent are numbers
respectively of occurrences of this triplet in title, description, bolder-text, content. We
use the same formula as the one for calculating importance weight of the entities in
Sect. to compute importance weight WT of each triplet in the news item.
Q.-M. Nguyen et al.
Then we represent these weights as elements of a vector then use vector normalization
formula to put these weights in the range [0, 1]. To calculate similarity between news
items based on semantic annotations, we use a variation of Cosine formula as described
in Sect. to compute the distance between two vectors. This value is denoted by
similaritybasedannotation .
Thus, we use three parameters to determine semantic similarity between news
items, based on the following factors:
– Relations between named entities,
– Types of entity in the news items,
– Semantic annotations of the news items.
Each of these three parameters has different meanings in determining semantic
similarity between news items. We combine these three parameters together to determine the final value showing semantic similarity between news items. To combine
these three parameters, we use a set of three parameters including hentity ; hannotation ; htype
to express the level of importance of each of the above parameters. We define the final
formula for calculating semantic similarity between two news items is as below:
Similaritysemantic ðA; BÞ
¼ similaritybasedentity ðA; BÞ hentity
þ similaritybasedannotation ðA; BÞ hannotation
þ similaritybasedtype ðA; BÞ htype
Content-Based Similarity
With news recommendation method in which only uses semantic similarity as proposed
above, we may encounter some problems as:
– Insufficient or incorrect identification of named entities that appear in the news item.
– Insufficient semantic annotations of the news item.
Occurrence of above limitations is caused by limited information in the ontology
and knowledge base. This is unavoidable since the construction of ontology and
knowledge base must be done manually or semi-automatically, so a lot of efforts need
to be made. Furthermore, the evolution of real world knowledge, for example when
new players come or players change their clubs, makes it difficult to timely update.
To overpass these limitations, we combine the proposed semantic similarity and
content similarity of two news items.
In this section we describe the content-based similarity which is computed using
TF-IDF weight of words in the news item combined with cosine measure.
Words with high TF-IDF weight are often important words, showing main contents
of the news item. So, we are only interested in words with high TF-IDF weight. Steps
to build a set of important words of the news item include:
– Step 1: Eliminate stop words. Stop words are words that do not make sense in the
representation of contents of the news, such as: “a”, “an”, “the”, etc.
Semantic-Based Recommendation Method
– Step 2: Standardize words into infinitive form. Verbs or nouns often exist in many
different forms depending on the context, although they still express the same
meanings. For example, “make”, “makes” and “made”. So, we will change them
into infinitive form.
– Step 3: Calculate TF-IDF for each word in the news (After being standardized in
Step 2).
– Step 4: Sort and select top words with the highest TF-IDF based on defined
After above steps, we obtain a set of words with the highest TF-IDF. We represent
news item in the form of a vector containing values vk as TF-IDF value of words in the
above set. Similarity measure between two news A and B with two important word sets
SA ; SB and two corresponding vectors VA ; VB will be calculated based on variation of
Cosine formula as below:
Pta 2SA ;tb 2SB
va vb
t t
SimilarityTFIDF ðA; BÞ ¼ P a b ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 Ptb 2SB 2
ta 2SA 2
va vb
In which:
– ta ; tb are corresponding words in two sets SA ; SB :
– va ; vb are TF-IDF values of words ta ; tb .
News Recommendation Algorithm with Combined Similarity
To combine semantic similarity Similaritysemantic with content similarity SimilarityTFIDF
of two news items, we use pair of weights csemantic andccontent . We define the combination
formula as below:
Similaritycombined ðA; BÞ ¼
Similaritysemantic ðA; BÞ csemantic þ SimilarityTFIDF ccontent
News recommendation algorithm as below:
Input: Target news item A and set N candidate news items C.
Output: set of K news items with the highest semantic similarity with A.
– Step 1: Identify named entities, make semantic annotations for news item A and
candidate news items in set C.
– Step 2: Build set of words with the highest TF-IDF weight for news item A and
news items in set C.
– Step 3: With each news Ci in set C, take the following steps:
• Step 3.1: Calculate Similaritybasedentity ðA; Ci Þ
• Step 3.2: Calculate Similaritybasedannotation ðA; Ci Þ
• Step 3.3: Calculate Similaritybasedtype ðA; Ci Þ
Q.-M. Nguyen et al.
• Step 3.4: Calculate Similaritysemantic ðA; Ci Þ based on the results of steps 3.1, 3.2
and 3.3.
• Step 3.4: Calculate SimilarityTFIDF ðA; Ci Þ
• Step 3.5: Calculate Similaritycombined ðA; Ci Þ based on the results of steps 3.4 and
– Step 4: Sort news items Ci in descending order according to value
Similaritycombined ðA; Ci Þ.
– Step 5: Get k news items in the top of the list sorted in Step 4 to recommend for
news item A.
Assume that nt is the average number of tokens in a news item and n is the number
of news items in dataset C. We see that, in step 1, the complexity of named entity
recognition and semantic annotation of a news item is O(ncnt), where nc is the total
number of classes, entities and properties in ontology and knowledge base. Therefore,
for n news items in the set C and a news item A, the time complexity of step 1 is O
(nncnt). Step 2 transfers n + 1 news items into vector TF-IDF. As we had computed the
IDF for all tokens in the dictionary before running the algorithm, the time complexity
of transferring a news item into a vector TF-IDF equal to the time complexity of
calculate TF values for all tokens in that news item, O(nt). Consequently the complexity of step 2 is O(nnt). On the other hand, step 3 is repeated n times for each
element in C. The steps from 3.1 to 3.4 are the multiplication of the pair of vectors
TF-IDF, therefore, the time complexity of each iteration is O(nt) and the time complexity of step 3 is O(nnt). The time complexity of the sort algorithm in step 4 is O
(nlogn). As a result, the time complexity of the proposed algorithm is O(nncnt + nlogn).
4 Experiment and Evaluation
Experiment Scenario
The goal of this chapter is to evaluate and compare the effectiveness of three news
recommendation methods:
– Only use semantic similarity between news items.
– Only use content similarity between news items.
– Combine both above similarities.
The evaluation of the different methods is performed by measuring precision.
Because we did not build an online system yet, so we use offline evaluation method for
evaluation. For offline evaluation, we choose N = 100 news items (symbolized as set
A) from a number of famous sports websites such as http://www.skysports.com/, http://
www.espnfcasia.com/, http://sports.yahoo.com/ and then we ask collaborators to rate
that a news item as relevant or non-relevant with another one. After that, we have an
experiment dataset in which each news item Ai will have KAi (0 KAi N 1) related
news items and (N 1 KAi ) unrelated news items. We separately run methods above
for each news item Ai in set A and also generate KAi news items with the highest
similarity with it, then compared with KAi news items that collaborators have identified
in experiment dataset. For example, consider the news item A1 , collaborators discover 5
Semantic-Based Recommendation Method
news items in the remaining 99 news items related to A1 , then algorithm automatically
run also generated 5 corresponding news items, then compared them with 5 news items
that collaborators have identified.
– TPAi is the number of news items that the algorithm precisely recommends for news
item Ai .
– FPAi is the number of news items that the algorithm imprecisely recommends for
news item Ai .
– FNAi is the number of related news items that the algorithm not recommend for
news item Ai .
We define precision for a news item Ai , using the following formula:
precisionðAi Þ ¼
Follow the way that we implement, we obtain FPAi ¼ FNAi , then
precisionðAi Þ ¼ recallðAi Þ. There for we only concern about precision to evaluate these
above methods. Finally, we define the final precision of the method as the average of
precisions for the entire N news items in the experiment dataset.
precisionðAi Þ
PrecisionðAÞ ¼ Ai 2A
Experiment Parameters
Certain parameters are employed to determine the importance of the components when
these components are combined together. In this experiment, we set the value of
parameters totally based on our point of view. For instance:
– Weights wp of relations in the ontology to calculate Wpath was assigned based on our
perception on the relevance of each relation: wmanagerOf ¼ 0:8; wplayFor ¼
0:6; wstadiumOf ¼ 0:5; . . .
– csemantic and ccontent are two parameters used when combining semantic similarity
measure and content similarity measure between news items. As we consider the
importance of content similarity is higher than the one of semantic similarity in
news recommendation, we choose csemantic ¼ 1; ccontent ¼ 2.
Experiment Results and Evaluation
After running three separate methods for set A containing 100 news items as experiment scenario as presented in Sect. 4.1, we obtain precision result of each method
shown in Table 1.
Q.-M. Nguyen et al.
Table 1. News recommendation precision in circumstances
Only use semantic similarity (semantic-based) 75.8%
Only use content similarity (content-based)
Combine both similarities (combined)
Assessment of Experiment Results. Table 1 indicated that, for the experiment data A
containing 100 news items, the semantic-based recommendation method is not as
precise as the content-based recommendation method. Meanwhile, if combining the
content-based similarity method and semantic-based similarity method, it will bring the
best results. This can be explained as follows:
– When using only the semantic-based similarity (semantic-based approach), it is
mainly dependent on the entities in the news items. Therefore, in some case, the
algorithm recommends correct news items about the relevant entities but the
completely different topic. For some collaborators, they will seem as irrelevant.
– Following the content-based approach, the recommended news item’s topic is
usually quite close to the target news item. However, this method does not have the
ability to expand the topic. If we have two news items about Barcelona club in
which the first news item is about the play of the Club and the second one is about
the transfer of the Club’s players, the content-based approach will determines that
the similarity of these news items is low.
– When combining the content-based similarity and semantic-based similarity, the
recommended news will overcome the limitations of each separated measure,
leading to more efficient recommendation.
5 Conclusions and Future Work
In this research, we presented a recommendation method based on the combination of
the content-based similarity and semantic-based similarity of the news items. The
semantic-based measure is calculated based on the semantic relation among objects. It
enables the recommendation not only stopping at the suggestion of the similar topic
news items or news items rounding a key object of the target news item, but also being
able to recommend the news items of other objects that these objects have a semantic
relation with other ones in the target news item. However, the similarity measure is
mainly focused on the entities and not considered the context mentioned in the news
item. The content-based measure will overcome the weakness of semantic-based
measure by extracting from the news item the words having the highest TF-IDF value
and these words are characterized the main context mentioned in the news item.
We evaluated and compared the precision of the proposed method and the recommendation method when using only either measure separately. The experimental results
showed that the combination of the two similarities helps to promote the effectiveness of
both and overcome the weaknesses of each other method, ultimately increasing the
better recommendation. However the proposed method remains some limitations such
Semantic-Based Recommendation Method
as its dependency on the adequacy of the knowledge base and ontology. Determining
the weights in such a way so that the combination of the measures achieves the highest
efficiency is also a difficult problem to be solved of the method.
1. Abdelrahman, A., Kayed, A.: A survey on semantic similarity measures between concepts in
health domain. Am. J. Comput. Math. 5, 204–214 (2015)
2. Ahn, J.W., Brusilovsky, P., Grady, J., He, D., Syn, S.Y.: Open user profiles for adaptive
news systems: help or harm? In: 16th International Conference on World Wide Web (WWW
2007), pp. 11–20. ACM (2007)
3. Aleman-Meza, B., Halaschek, C., Arpinar, I.B., Sheth, A.: Context-aware semantic
association ranking. In: Proceedings of the Semantic Web and Database Workshop, Berlin,
pp. 33–50
4. Batet, M., Sánchez, D., Valls, A.: An ontology-based measure to compute semantic
similarity in biomedicine. J. Biomed. Inform. 44, 118–125 (2011)
5. Billsus, D., Pazzani, M.J.: A personal news agent that talks, learns and explains. In: 3rd
Annual Conference on Autonomous Agents (AGENTS 1999), pp. 268–275. ACM (1999)
6. Capelle, M., Hogenboom, F., Hogenboom, A., Frasincar, F.: Semantic news recommendation using WordNet and bing similarities. In: Proceedings of the 28th Annual ACM
Symposium on Applied Computing, pp. 296–302
7. Elahi, A., Javanmard Alitappeh, R., Shokohi Rostami, A.: Improvement TFIDF for news
document using efficient similarity. Res. J. Appl. Sci. Eng. Technol. 4(19), 3592–3600
8. Frasincar, F., IJntema, W., Goossen, F., Hogenboom, F.: Ontology-based news recommendation. In: Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, 22–26
March 2010
9. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th
New Zealand Computer Science Research Student Conference, Christchurch, New Zealand,
pp. 49–56, 14–18 April 2008
10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process.
Manag. Int. J. Arch. 24(5), 513–523 (1988)
11. Tuan-Dung, C., Quang-Minh, N., Hoang-Cong, N., Hagino, T.: Towards efficient sport data
integration through semantic annotation. In: Proceeding of The Fourth International
Conference on Knowledge and Systems Engineering KSE 2012, Da Nang Viet Nam,
pp. 99–106, August 2012. ISBN 978-1-4673-2171-6
Using SPEM to Analyze Open Data
Publication Methods
Jan Kučera(&) and Dušan Chlapek
University of Economics, Prague, Czech Republic
Abstract. Open Data is a current trend in sharing data on the Web. Public
sector bodies maintain large amounts of data that, if re-used, could be a source
of significant benefits. Therefore Open Government Data initiatives have been
launched in many countries in order to increase availability of openly licensed
and machine-readable government data. Because Open Data publishers face
various challenges, methods for publication of Open Data are emerging.
However these methods differ in focus, scope and structure which might complicate selection of a method that would suit specific needs of an organization. In
this paper we discuss the possible benefits of constructing Open Data publication methods from a meta-model and we use the Software and Systems Process
Engineering Meta-Model version 2.0 to analyze similarities and differences in
structure of three Open Data publication methods.
Keywords: Analysis Method Open Data Open Government Data
Software and Systems Process Engineering Meta-Model SPEM
1 Introduction
Open Data is data “that can be freely used, re-used and redistributed by anyone –
subject only, at most, to the requirement to attribute and sharealike” [22]. Further
details on what “open” means are provided by the Open Definition [21]. Legal and
technical openness are the key aspects of ensuring reusability of data [19]. Legal
openness is achieved by open licensing of data, i.e. by making data available under a
license that permits its free re-use and redistribution. In order to minimize the technical
obstacles Open Data should be made available for free download as a complete dataset
in a machine-readable format.
Re-use of data held by public sector bodies could be a source of social and economic value [1]. Despite the fact that a number of countries have already launched their
Open Government Data initiatives, many important datasets remain closed [30].
Publishing Open Government Data could be a challenging task and publishers often
face various organizational, legal, technical and other barriers [11, 29].
In order to help the Open Data publishers to overcome the barriers and to promote
the recommended practices for its publication various Open Data publication methods
have been developed [23, 27, 28]. On one hand knowledge about how to open up data
is being gathered, on the other hand this knowledge is documented in different methods
and their heterogeneity might make integrating their content difficult. Zuiderwijk et al.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 48–58, 2016.
DOI: 10.1007/978-3-319-49944-4_4
Using SPEM to Analyze Open Data Publication Methods
[31] also point out that the Open Data publication process should be standardized
across an organization. Such a standardization requires sharing the information about
the Open Data publication process across the organization.
In the software engineering domain practitioners are also struggling with difficulties
in combining and integrating content about the development processes due to the
heterogeneity of the sources of this content and with providing the development teams
with an access to a shared body of information about the development process [18].
This situation led to development of the Software and Systems Process Engineering
Meta-Model (SPEM) – a conceptual framework and meta-model providing concepts
that allows “modeling, documenting, presenting, managing, interchanging, and
enacting development methods and processes” [18].
The goal of this paper is to discuss the possible benefits of constructing Open Data
publication methods from a meta-model and the possible benefits of use of SPEM 2.0
and to analyze similarities and differences in the structure of three Open Data publication methods using the SPEM 2.0 meta-model elements. Based on this analysis we
assess how the analyzed methods are constructed.
This paper is structured as follows. In the following chapter Open Data publication
method is defined and examples of the existing methods are provided. Then the
potential benefits of constructing an Open Data publication method from a meta-model
in general and the benefits of using SPEM 2.0 in particular are discussed. Related work
is described in the next section. In the following section a short overview of the SPEM
2.0 meta-model elements is provided. Then the results of the structural analysis of the
three selected Open Data publication methods are presented. Conclusions are summarized at the end of this paper.
2 Open Data Publication Methods
Brinkkemper [3] provides definitions of the terms method, technique, tool and
methodology in the information systems development domain. He defines a method as
“an approach to perform a systems development project, based on a specific way of
thinking, consisting of directions and rules, structured in a systematic way in development activities with corresponding development products”, whereas he views a
methodology of information systems development as “scientific theory building about
methodical information systems development” [3]. He also points out that the term
methodology is sometimes used incorrectly standing for method.
We share the view of Brinkkemper that the term methodology should be used to
refer to the theory of methodical aspects of some particular field. Therefore we use the
term Open Data publication (ODP) method in this paper which we broadly define as an
approach to the publication of Open Data consisting of recommendations about what
should be done or achieved when publishing Open Data or how it should be
Number of ODP methods have already been developed. For example Project Open
Data [23] provides guidance, tools and case studies in order to help agencies in the USA
to implement the Open Data policy. Socrata, a provider of solutions for publication of
Open Data, also provides its own ODP method called “Open Data Field Guide” [28].
J. Kučera and D. Chlapek
As of September 2016 a list of forty guides for implementation of the revised PSI
(Public Sector Information) Directive (Directive 2003/98/EC [7] amended by the
Directive 2013/37/EU [6]) and for publication of Open Data has been collected during
the Share-PSI 2.0 project [27]. This list contains both international as well as national
ODP methods of the European states. The national ODP methods are usually written in
the local language of the particular country and the list [27] also shows that they differ
in what practices for publication of Open Data and PSI are recommended by these
methods. These methods do not differ only in language and content but also in format
and structure. For example the Open Data Handbook of Flanders [9] represents a
document in PDF structured into chapters. On the other hand DCAT application profile
implementation guidelines [5] are represented in a form of web pages with a common
3 Benefits of Constructing Open Data Publication Methods
from a Meta-Model
Brinkkemper [3] introduced the term method engineering and he points out that
meta-modelling techniques are needed for design and evaluation of methods.
Gonzalez-Perez et al. [8] argue that software development methods constructed from a
meta-model “usually offer a higher degree of formalisation and better support for
consistent extension and customisation, since the concepts that make their foundations
are explicitly defined”.
Making data available for re-use requires adequate workflows [29]. These workflows could be set up by implementing the suitable ODP method. However, as we
indicated with the examples of the existing ODP methods, these methods might differ
in scope, focus or structure which might complicate selection of a method that would
suit the needs of a particular Open Data publisher or finding compatible ODP methods
in situations where more than one method need to be applied.
Explicit definition of the concepts that the ODP methods are built from could make
identification of the same or similar concepts across different ODP methods easier. This
in turn could help the Open Data publishers in assessing, selecting and customizing the
relevant ODP methods. Development and implementation of the ODP methods should
therefore benefit from use of meta-models.
Software and Systems Process Engineering Meta-Model [18] is an Object Management Group (OMG) specification. It tries to address some of the problems that
organizations face when developing systems such as lack of an easy access to a shared
body of information about the development process, difficulties in combining content
from different sources describing methods and practices due to their different presentation and style and difficulties in defining systematic development approach that fits
the specific needs of an organization. The primary focus of SPEM are software
development processes but it allows representing processes in other domains as well
which is demonstrated in the specification with a case study describing a process for
investments clubs [18].
Representing the Open Data publication methods as the SPEM method content and
processes could bring the Open Data practitioners the similar benefits as it brings to the
Using SPEM to Analyze Open Data Publication Methods
software development organizations. Possible benefits to the ODP methods resulting
from the key SPEM 2.0 capabilities are summarized in Table 1.
Table 1. Possible benefits of use of SPEM 2.0 to the ODP methods, source (based on [18])
Key SPEM 2.0 capability
Separation of method content from the
application of method content in a specific
development process
Consistent maintenance of different
development processes
Ability to represent processes based on
different lifecycle models and approaches
Plug-in mechanism that enables processes to
be extended or customized without
modifying the original content
New processes could be assembled from
reusable process patterns
Process components might be linked with
inputs and outputs but the development team
could be allowed to choose the appropriate
activities and techniques
Possible benefits to the ODP methods
Method content related to publication of
Open Data could be represented in a
standardized way independent on a particular
process. This would allow its use in different
Open Data publication processes which in
turn might help sharing of good practice
Open Data publication processes could be
systematically developed and maintained
Standardized ODP method content and
processes could be configured for use in
specific projects or environments, e.g. ODP
processes could be configured to be in line
with the approaches of different types of
Open Data publishers
Generally applicable recommendations for
publication of Open Data could be extended
or customized with specific guidelines, e.g.
guidelines for publication of a specific
category of data
Process patters for implementing the
recommendations provided by an ODP
method could be developed. Open Data
practitioners following the given ODP
method could re-use the patters in their
If appropriate ODP methods could focus on
the required or recommended outputs rather
than activities of the Open Data publication
process. Open Data practitioners might be
allowed to select the most appropriate
activities or technique for achieving the
outputs depending on the situation
4 Related Work
Several authors discussed or used SPEM in various contexts. Bendraou et al. [2]
compared six UML-based languages for software process modeling including SPEM
1.1 and SPEM 2.0. Henderson-Sellers [10] analyzed differences in granularity and
ontologies of several standards including SPEM.
Martınez-Ruiz et al. [13] propose an extension to SPEM that would allow better
modelling of the software process variability. Rodríguez-Elias et al. [24] adapted
SPEM for modelling and analysis of knowledge flows in software processes.
J. Kučera and D. Chlapek
Moraitis and Spanoudakis [15] present the Gaia2JADE process for multi-agent
systems development that is described using SPEM specification. Another examples of
the SPEM use could be found in the work of Brusa et al. [4] where a process for
building a public domain ontology is based on SPEM and also in the work of Loucopoulos and Kadir [12] where BROOD (Business Rules-driven Object Oriented
Design) process is represented using SPEM. Saldaña-Ramos et al. [25] proposed a
competence model for testing teams and represented it using SPEM.
5 SPEM 2.0 Meta-Model Elements
Key feature of SPEM is a separation of the method content definitions from its
application in the development process [18]. Method content represents libraries of
reusable content such as definition of tasks, roles, tools or work products that is
independent on its application in the specific step of a development lifecycle. In SPEM
process represents a specific way of performing some project, e.g. software development project using a specific technology.
Separation of the reusable method content from the development processes allows
defining various processes with their own lifecycles and work breakdowns that build
upon the same base components providing recommendations about how to achieve the
common development goals. SPEM also reflects the fact that projects are unique and
allows configuration of the method content and processes to fit the needs of a specific
SPEM provides meta-model classes as well as the UML stereotypes (SPEM 2.0
UML 2 Profile) for representing elements of both method content and processes [18].
According to [18] the key method content elements are Task Definitions, Work Product
Definitions, Role Definitions and Guidance.
Task Definition represents an assignable unit of work and it is assigned to specific
Role Definitions [18]. A Task Definition could be broken down into Steps. Work
Product Definition represents work products that are consumed, produced or modified
by Task Definitions. Role Definition is “a set of related skills, competencies, and
responsibilities of an individual or a set of individuals” [18]. Categories can be used to
categorize the content into logical groups such as requirements management.
The key process elements are Activities and “use” elements for representing use of
the method content elements in the context of a specific process. Activity represents a
unit of work within a Process [18]. Activities can be nested to form breakdown
structures. Although the Process has a distinct symbol in SPEM 2.0, it is represented
by the Activity class in the SPEM 2.0 UML profile [18]. Therefore only the Activity is
taken into account in the analysis described in the following section.
Task Use, Role Use and the Work Product Use are specializations of the abstract
Method Content Use element that represents a use of a particular method content
element in the context of some Activity. Method Content Use element ensures the
separation of the method content from a process and it allows overriding the method
content elements with the specifics of the given process.
Role Use and the Task Use instances are linked to the corresponding Activity
instances with instances of the Process Performer which can also be used to distinguish
Using SPEM to Analyze Open Data Publication Methods
how a particular role is involved in the process, e.g. it can be used to present the RACI
(responsible, accountable, consulted, informed) relationships [18]. Similarly a Process
Parameter links an Activity or a Task Use with a Work Product Use to indicate whether
the Work Product Use is an input or an output of the Activity/Task Use or both.
However in the SPEM 2.0 UML profile the Process Parameter instances are not
represented as classes but as associations with the ParameterIn (input), ParameterOut
(output) or ParameterInOut (input and output) stereotypes.
Additional information about both the method content and the process elements
could be provided by Guidance. In order to distinguish various types of guidelines
Guidance can be classified with Kinds. SPEM 2.0 specification [18] also contains a
Base Plug-in which provides instances of Kinds for Guidance as well as for Activity,
Category, Work Product Definition and Work Product Relationship.
6 Analyzing Open Data Publication Methods Using SPEM
In this section we use the SPEM 2.0 meta-model to analyze structure of three Open
Data publication methods. First the analyzed ODP methods are briefly introduced, then
the analysis approach is explained. Results of the analysis are discussed at the end of
this section.
Analyzed Open Data Publication Methods
We selected three ODP methods in whose development we were involved because we
are familiar with their structure and semantics. The following methods were analyzed:
1. Best Practices for Sharing Public Sector Information (Share-PSI 2.0 Best Practices)
2. Methodology for publishing datasets as open data (COMSODE method)
3. Standards for publication and cataloguing of Open Data of the public sector in the
Czech Republic (Czech OGD standards)
Best Practices for Sharing Public Sector Information [26] represent a lightweight
approach focusing on providing a guidance rather than a process. On the contrary
Methodology for publishing datasets as open data [16] represents a process-oriented
approach to publication of Open Data. Both the Share-PSI 2.0 Best Practices and the
COMSODE method target an international audience and thus they provide no recommendations specific to a particular region. Czech OGD standards [14] represent a
national ODP method that should be followed by the public sector organizations in the
Czech Republic.
Analysis Approach
Neither of the analyzed ODP methods is based on the SPEM meta-model. For each of
these methods SPEM 2.0 elements were identified that were considered appropriate to
represent the content of the given ODP method based on their semantics. Elements for
which stereotypes are defined and summarized in the Annex A of the SPEM 2.0
J. Kučera and D. Chlapek
specification [18] were considered in the analysis. If the content of the analyzed ODP
methods was described or represented in a way that is independent on the process,
appropriate SPEM method content elements were chosen. If it was not possible to
separate the content from the process, e.g. in cases where the description referenced a
particular part of the process, the SPEM process elements were selected.
The Czech OGD standards are represented as a set of web pages. Sometimes one
page contained both the process-independent and the process-dependent content. In
such cases more than one SPEM meta-model element was considered to represent the
Because all of the analyzed ODP methods contain guidance, we further analyzed
what kind of guidance is provided by mapping the provided guidance to the guidance
kinds specified in the SPEM 2.0 Base Plug-in.
SPEM 2.0 [18] also provides means for managing the whole libraries of the method
content and process, i.e. Method Plugins. However this part of the SPEM 2.0 specification was not considered in the analysis because it focuses on the extensibility and
variability mechanism rather than on the structure of the content.
Analysis Results
Table 2 provides an overview of the SPEM 2.0 meta-model elements considered as
suitable to represent the content of the analyzed ODP methods. SPEM 2.0 defines a
broader set of elements than presented in Table 2, however elements not present in the
analyzed ODP methods are excluded from the overview.
Table 2. SPEM 2.0 meta-model elements applied in the analyzed ODP methods
SPEM 2.0 meta-model
Role definition
Role use
Task definition
Task use
Work product definition
Work product use
Share-PSI 2.0 best
Czech OGD
Using SPEM to Analyze Open Data Publication Methods
The Share-PSI 2.0 Best Practices provide only the method content in a form of
guidance. Therefore they could be easily referenced from other Open Data publication
methods (for example Share-PSI 2.0 Best Practices are directly referenced from the
Solutions Bank of the Open Data Handbook [20]). Share-PSI 2.0 Best Practices are
categorized according to a set of PSI elements [26], i.e. topics addressed by the PSI
Directive. However semantics of this categorization corresponds to none of the SPEM
2.0 Base Plug-in category kinds (discipline, domain, role set and tool category).
Compared to the Share-PSI 2.0 Best Practices the COMSODE method as well as
the Czech OGD standards are constructed from a broader set of concepts and they not
only specify what should be done in order to publish Open Data but also who should be
involved and what the expected outcomes are in terms of the work products. They both
define a process for publication of Open Data that is broken down into phases and
activities. COMSODE method also specifies the recommended steps for achieving the
specified tasks.
COMSODE method, especially in the annex 2 [17], clearly separates elements such
as activities, phases or performers (roles) and links them with relationships (for
example activities and performers are linked with the responsibility relationships using
the RACI chart).
Czech OGD standards are also highly structured, however description of phases or
individual activities sometimes presumes a certain sequence of work. The Czech OGD
standards are intended to provide the recommended process that should be followed
within the Czech public administration. The process orientation of the Czech OGD
standards is therefore in line with this purpose. However extracting knowledge
applicable in other contexts would require separation of the method content from the
process itself.
Table 3 summarizes kinds of guidance provided by the analyzed methods. The
following kinds of guidance were not identified in the analyzed methods: Checklist,
Estimate (metric kind), Estimation Considerations (metric kind), Estimating Metric
(metric kind), Example, Report, Reusable Asset, Supporting Material and Roadmap.
As the name suggests practices are the main kind of guidance provided by the
Share-PSI 2.0 Best Practices. However external sources are referenced as well which
were classified as the SPEM whitepapers. The COMSODE method explains the concept of Open Data, provides a glossary of terms as well as a wide range of practices for
conducting the tasks and activities.
Table 3. Guidance kinds available in the analyzed ODP methods
SPEM 2.0 guidance
Term definition
Tool mentor
Share-PSI 2.0 best
Czech OGD
J. Kučera and D. Chlapek
A Guideline provides “additional detail on how to perform a particular task or
grouping of tasks” [18]. This additional detail on how the Open Data should be
published is provided by a reference internal directive that is a part of the Czech OGD
standards. Czech OGD standards also include reference Open Data publication plans
that can be used as templates. Guidance on how to register datasets in the Czech
National Open Data Catalogue is provided as well which represents the tool mentor
7 Conclusions
Openly licensed machine-readable data could be a source of social and economic value
[1, 29]. Open Data movement is strong in the public sector domain and the release of
data held by public sector bodies for re-use is sometimes even encouraged by the
legislative means such as the European PSI directive [6].
Methods that provide the publishers with recommendations how to overcome the
problems commonly faced when publishing Open Data are emerging. Use of
meta-models could help the Open Data practitioners when assessing, selecting and
customizing the Open Data publication methods because the concepts that form the
building blocks of these methods are more likely to be explicitly defined.
Software and Systems Process Engineering Meta-Model [18] is a common
meta-model for representing the development methods and processes which is intended
to make their development, maintenance and interchange easier. In this paper we
analyzed structure of three ODP methods by identification of the SPEM 2.0
meta-model concepts that were considered suitable for representing the content of the
analyzed methods.
This paper presents results of an ongoing research. In the future research we will
further assess suitability of SPEM 2.0 as the meta-model for engineering of the ODP
methods. Zuiderwijk et al. [31] point out that multiple versions of the processes for
publication of Open Data might be required for different types of data. Therefore we
will also focus on the extension and variability mechanism offered by SPEM and its
potential application for building bodies of information about publication of Open Data
that could be shared and customized to fit the needs of the specific organizations and
the types of data they manage and publish.
Acknowledgements. Paper was processed with contribution of long term institutional support
of research activities by Faculty of Informatics and Statistics, University of Economics, Prague.
1. Capgemini Consulting: Creating Value through Open Data: Study on the Impact of Re-use
of Public Data Resources (2015). http://www.europeandataportal.eu/sites/default/files/edp_
2. Bendraou, R., Jezequel, J.-M., Gervais, M.-P., Blanc, X.: A comparison of six UML-based
languages for software process modeling. IEEE Trans. Softw. Eng. 36(5), 662–675 (2010)
Using SPEM to Analyze Open Data Publication Methods
3. Brinkkemper, S.: Method engineering: engineering of information systems development
methods and tools. Inf. Softw. Technol. 38(4), 275–280 (1996)
4. Brusa, G., Caliusco, M.L., Chiotti, O.: Towards ontological engineering: a process for
building a domain ontology from scratch in public administration. Expert Syst. 25(5), 484–
503 (2008)
5. European Union: DCAT application profile implementation guidelines (2016). https://
6. European Union: Directive 2013/37/EU of the European Parliament and of the Council of 26
June 2013 amending Directive 2003/98/EC on the re-use of public sector information
(2013). http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32013L0037
7. European Union: Directive 2003/98/EC of the European Parliament and of the Council of 17
November 2003 on the re-use of public sector information (2003). http://eur-lex.europa.eu/
8. Gonzalez-Perez, C., McBride, T., Henderson-Sellers, B.: A metamodel for assessable
software development methodologies. Softw. Qual. J. 13(2), 195–214 (2005)
9. Government of Flanders in Belgium: Open Data Handbook (Open Data Handleiding)
(2016). https://overheid.vlaanderen.be/sites/default/files/Open_data_handboek_2016.pdf
10. Henderson-Sellers, B.: Standards harmonization: theory and practice. Softw. Syst. Model. 11
(2), 153–161 (2012)
11. Janssen, M., Charalabidis, Y., Zuiderwijk, A.: Benefits, adoption barriers and myths of open
data and open government. Inf. Syst. Manag. 29(4), 258–268 (2012)
12. Loucopoulos, P., Kadir, W.M.N.W.: BROOD: business rules-driven object oriented design.
J. Database Manag. 19(1), 41–73 (2008)
13. Martınez-Ruiz, T., Garcıa, F., Piattini, M., Münch, J.: Modelling software process
variability: an empirical study. IET Softw. 5(2), 172–187 (2011)
14. Ministry of the Interior of the Czech Republic: Standards for publication and cataloguing of
Open Data of the public sector in the Czech Republic (Standardy publikace a katalogizace
otevřených dat VS ČR) (2016). http://opendata.gov.cz/
15. Moraitis, P., Spanoudakis, N.: The Gaia2JADE process for multi-agent systems development. Appl. Artif. Intell. 20(2–4), 251–273 (2006)
16. Nečaský, M. Chlapek, D., Klímek, J., Kučera, J., Maurino, A., Rula, A., Konecny, M.,
Vanova, L.: Deliverable D5.1: methodology for publishing datasets as open data (2014).
17. Nečaský, M. Chlapek, D., Klímek, J., Kučera, J., Maurino, A., Rula, A., Konecny, M.,
Vanova, L.: Deliverable D5.1: methodology for publishing datasets as open data.
Methodology Master Spreadsheet (2014). http://www.comsode.eu/wp-content/uploads/
18. Object Management Group: Software and Systems Process Engineering Meta-Model
Specification, version 2.0 (2008). http://www.omg.org/spec/SPEM/2.0/PDF
19. Open Knowledge: How to Open Up Data? http://opendatahandbook.org/guide/en/how-toopen-up-data/
20. Open Knowledge: Open Data Handbook, Solutions Bank. http://opendatahandbook.org/
21. Open Knowledge: Open Definition 2.1. http://opendefinition.org/od/2.1/en/
22. Open Knowledge: What is Open Data? http://opendatahandbook.org/guide/en/what-is-opendata/
23. Project Open Data. https://project-open-data.cio.gov/
J. Kučera and D. Chlapek
24. Rodríguez-Elias, O.M., Martínez-García, A.I., Vizcaíno, A., Favela, J., Piattini, M.:
Modeling and analysis of knowledge flows in software processes through the extension of
the software process engineering metamodel. Int. J. Softw. Eng. Knowl. Eng. 19(2), 185–
211 (2009)
25. Saldaña-Ramos, J., Sanz-Esteban, A., García-Guzmán, J., Amescua, A.: Design of a
competence model for testing teams. IET Softw. 6(5), 405–415 (2012)
26. Share-PSI 2.0: Best Practices for Sharing Public Sector Information. https://www.w3.org/
27. Share-PSI 2.0: Guides to Implementation of the (Revised) PSI Directive (2016). https://
28. Socrata: Open Data Field Guide (2016). https://socrata.com/open-data-field-guide/
29. Ubaldi, B.: Open Government data: towards empirical analysis of open government data
initiatives. In: OECD Working Papers on Public Governance, vol. 22. OECD Publishing
30. World Wide Web Foundation: Open Data Barometer: ODB Global Report Third Edition
(2015). http://opendatabarometer.org/doc/3rdEdition/ODB-3rdEdition-GlobalReport.pdf
31. Zuiderwijk, A., Janssen, M., Choenni, S., Meijer, R.: Design principles for improving the
process of publishing open data. Transforming Gov.: People Process Policy 8(2), 185–204
OGDL4M Ontology: Analysis of EU Member
States National PSI Law
Martynas Mockus(&)
CIRSFID, University of Bologna, Via Galliera 3, Bologna 40121, Italy
[email protected]
Abstract. Developers of Open Government Data Mash-ups face the following
legal barriers: different licenses, legal notices, terms-of-use and legal rules from
different jurisdictions that are applied to an open datasets. This paper analyzes
implementation of Revised PSI Directive in EU Member states, also highlights
the legal problems. Moreover it analyzes how Public Sector Information is
defined by the national law and what requirements are applied to the datasets
released by public sector institutions.
The results of the paper show that PSI regulation in EU Member countries is
very different and the implementation of revised PSI Directive is not successful.
These problems limit the reuse of Open Government Datasets.
The paper suggests the ontology in order to understand the requirements that
originate from the national EU Member countries law and which are applied to
Open Government Datasets. Also, the ontology models different implementations of the EU PSI Directive in the Member countries.
Keywords: Open data mashup
Licensing of open data Ontology
1 Problem and Motivation
Open data, open government data definitions and principles were presented in our
previous work [1]. This paper will focus on how the technology could be used in
dealing with a different regulation of the important subject – open government data
In general, data is a fuel for Enterprise Information Systems. According to the
Report [1] EU economy could potentially grow by 1.9 % GDB by 2020 as a result of
reusing big & open data. In the ideal World the idea of Linked Open Data [2] could be
realized easily, but the law and the regulation of data make this idea hard to accomplish
in a real-life. Governments, municipalities and other public bodies are releasing Public
Sector Information (PSI) under different legal and technical conditions, which are
unstable and create artificial barriers to get benefits from the re-use of information.
Probably, the most efficient results that follow from the use of open government data
can be extracted when the data is merged, connected, combined, mixed or enriched and
analyzed in other ways. However the legal problems, that do not allow to do it
smoothly and to reach the expected economic benefits, exist.
Open data licenses (or other regulation as legal notices, terms of use) are not
unified. This problem influences a deep analysis of open data licenses for every
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 59–73, 2016.
DOI: 10.1007/978-3-319-49944-4_5
M. Mockus
developer before starting to connect different datasets in a mashup model. The results
of The Survey of the Licensing of Open Government Data [3] had discovered a critical
situation concerning regulation (licensing) regime: the national open government data
portals consist of datasets which are protected by different licensing regimes starting
from 33 (Spain), 16 (Germany, Italy) and ending up to 1–2 (Austria, EC, Moldova,
Portugal, UK) regimes.
Different licensing terms mean that: first of all it is not clear if the datasets can be
merged, used for commercial purposes or are there any limitations applied to the
mashup work protection, also if the different Adapters licenses can be used. The Survey
[3] identified that OGD portals consists of the datasets, which identify wrong licensing
regimes, or do not identify any licensing regime at all (it is not clear if the link to
regulation is missing, or there is no regulation applied), or the rules that come from
national PSI law are not being copied. This situation creates a possible risk that government (the owner of OGD) could start legal procedures against the developers of
OGD because of violation of the national PSI rules, even when notification about the
licensing regime is provided not correctly by the government itself.
So how the developers of Enterprise Information Systems which use OGD could
avoid investments to legal analysis of OGD regulation and to reduce risks coming from
possible failure of misinterpretation of national law in the global environment? The
possible solution is to force governments to withdraw all regulation of the OGD, or
alternative solution is to have a tool which provides legal analysis of OGD automatically, or at least semi-automatically.
We believe that it is possible to create such a tool. We decided to deal with the legal
problems coming from EU Member States in that way: (1) we have identified general
problems existing in the PSI domain of EU (different regulation object in national law,
PSI directive and Revised PSI directive is not implemented fully); (2) we have found
what kind of specific legal requirements are applied to open government datasets by
national PSI law and (3) we have tried to model those requirements in the Ontology
aiming to create a useful tool to understand the complexity of OGD regulation on EU
This paper is organized: (1) introduction to the problem and motivation; (2) analysis of implementation of Revised PSI Directive; (3) analysis of EU Member States
national PSI law; (4) ontology for the legal requirements of OGD; (5) conclusions and
future work.
2 Open Government Data: Legal Problems Coming from EU
in Re-Use of PSI Domain
In European Union the philosophy of re-use of public information and the main legal
requirements applied to Open Government Data are coming from PSI Directive. If the
concept of PSI Directive [2] (including Revised PSI Directive [3]) worked as it is
planned, legal problems concerning the re-use of open datasets would not exist.
Unfortunately the reality is different. EU Commission still has a lot of work to do in
order to change the existing opinion, that the information hold by the public institution
is the property of the state and “no one can touch it”.
OGDL4M Ontology: Analysis of EU Member States National PSI Law
Our investigation has found that the development of EU Commission supported
PSI concept could be grouped as:
(1) The period before the PSI Directive was adopted;
(2) The period of implementation of the PSI Directive (* 2003/2005–2013/2015);
(3) The period of revision of the PSI Directive in 2013 and its implementation.
Before the PSI directive was adopted, the concept of PSI was developing
de-centralized in EU member and pre-member countries. Every single country had its
own independent concept which had created “Tower of Babel” effect. In 2003 the PSI
Directive was published and should have been implemented until 2005. PSI directive
sets a minimum harmonisation of national rules and practices of PSI concept and its
re-use. Implementation of PSI directive wasn’t enough successful in Community and
revision of PSI directive was made after 10 years. The revised PSI directive gives tools
to EU Commission to control the implementation of the PSI directive and hopefully in
the next years the united concept of PSI in EU could be found, if EU Commission
could use those tools effectively.
Implementation of Revised PSI Directive
The survey investigated the laws of the national PSI law of Member states published in
the Portal of European Commission [4].
There are some explanations of the Table 1: (1) in Spain different charges for the
commercial re-use may apply while Revised PSI Directive do not allow such an option;
(2) in Latvia the re-use is allowed only for private individuals; (3) in Denmark charging
principles are not applied; (4) in Hungary different terms of exclusive arrangements are
provided from the 1st of January 2016 instead of the 17th of July 2013 and Hungary
excludes libraries, museums and archives, university libraries from the duty to provide
the information for the re-use and etc. (5) Finland has not implemented the PSI
directive because it had already implemented their unique concept: PSI belongs to the
public domain.
Table 1. Implementation of revised PSI directive
Status of implementation
Have been implemented fully
Have been implemented all main terms and
only minor regulation is not harmonized
Have been implemented the main terms but
some important are missing
Have been implemented the different terms,
even contra terms
Have not been implemented Revised PSI
directive by the term (18 July 2015)
Austria, Italy and Malta
Germany, The Netherlands and The United
Greece, Spain, Sweden
Denmark, Hungary and Latvia
Belgium, Bulgaria, Croatia, Czech Republic,
Cyprus, Estonia, France, Ireland, Lithuania,
Luxemburg, Poland, Portugal, Romania,
Slovakia, Slovenia
M. Mockus
Analysis of National PSI Law
As we already have found the implementation of Revised PSI Directive was not
successful, we continued the analysis of national PSI law to get a clear view regarding
the legal framework and discover the differences that follows from the OGD regulation.
We have asked two questions to start the legal analysis of national PSI laws in EU
Member States: (1) Does the investigation object – public sector information - is
understood in the same way as it is defined in EU PSI Directive, if not? If yes, then how it differs? (2) What are the legal requirements applied to OGD licensing?
Analysis of PSI Term Used in Legal Domain of EU Member Countries. Analysis
of the legal domain in EU and its member countries indicates that the main problem is
that term “Public sector information” is differently understood in EU member countries,
but EU legislation is trying to gather different concepts to one united concept of PSI.
In the wider approach, PSI concept could be found not only de-centralized or united,
but also direct or expanded. Direct concept covers the idea of the concept which
already comes exactly from the term “Public sector information” and includes different
forms of information managed by Public sector. Expanded concept fulfills the direct
concept by extra rules, exceptions and tasks.
There is a good example of direct PSI definition published by The Organization
for Economic Co-operation and Development (OECD): Public sector information is
“information, including information products and services, generated, created, collected, processed, preserved, maintained, disseminated, or funded by or for the
Government or public institution” [5]. OECD PSI definition is clear enough and
describes PSI basically as all the information that with holds the Public institution.
EU PSI Directive represents expanded form of PSI concept and presents a bit
different concept of PSI (comparing to OECD), because the PSI concept has been
developed from “the right to get access to public information” and it’s basically could
be described shortly as accessible information to public which can be re-usable by
public and it is hold by Public institution. This concept during 10 years has changed a
bit from “can be re-usable” (in PSI Directive, 2003) to “must be re-usable” (in
Revised PSI Directive, 2013).
The term ‘information” got expansive meaning in nowadays and usually is used as
synonym to data, records, documents and etc. Erik Borglund and Tove Engvall
investigated how the open data discourse is communicated in legal text and they found
out that there is no single term and the principal words are: record, information,
document and data [6].
It is not a surprise that the terminology problems arrive to European Union, especially including its Member States’ legislation. In European Union Member States
legislation Public sector information (PSI) definition is understood differently.
In Directive 2003/98/EC (PSI Directive) PSI is understood as a “document” and
during revision of the directive the definition was not changed but concept was
expanded in Directive 2013/37/EC (Revised PSI Directive). Implementation of PSI
Directive and the Revised PSI Directive in the EU Member States still is developing, so
the PSI definition is not yet harmonized by EU Member States national law.
OGDL4M Ontology: Analysis of EU Member States National PSI Law
Definition of the document is provided by Directive Article 2 Para 1 s 3: ‘Document’ means: (a) any content whatever its medium (written on paper or stored in
electronic form or as a sound, visual or audiovisual recording); (b) any part of such
content.” [2] So basically, Public sector information is understood as document or part
of the document, no matter what form or content. In preamble of Directive term
“document” used as synonym to information and includes also data.
In legal interpretation term “document” is more related to legal responsibility of
institution or information holder comparing to other terms as “information” or “data”.
Also, concept “access to documents” comes from “right to get information from public
sector” and it was understood as right to get some concrete documents.
Secondly, after 10 years PSI directive was revised with an intention to harmonize
more the PSI definition in member states. The legislators of Directive 2013/37/EU
(revised PSI directive) noted: “since the first set of rules on re-use of public sector
information was adopted in 2003, the amount of data in the world, including public
data, has increased exponentially and new types of data are being generated and
collected (recital 5).” [3] “At the same time, Member States have now established
re-use policies under Directive 2003/98/EC and some of them have been adopting
ambitious open data approaches to make re-use of accessible public data easier for
citizens and companies beyond the minimum level set by that Directive. To prevent
different rules in different Member States acting as a barrier to the cross-border offer of
products and services, and to enable comparable public data sets to be re-usable for
pan-European applications based on them, a minimum harmonization is required to
determine what public data are available for re-use in the internal information market,
consistent with the relevant access regime. (recital 6)” [3]. On one hand, legislators
expressed their good will to harmonize “public data” (it affects internal European
information market) in preamble of Revised PSI Directive but, on other hand,
important changes to definition was not done in the text of PSI Directive Article 2, only
the concept of PSI was updated.
Thirdly, the PSI directive 2003/98/EC is implemented in all EU member countries
and EEA countries (Iceland, Liechtenstein and Norway). The problem exists that
“EU Member States have implemented the PSI Directive in different ways. 13 Member
States have adopted specific PSI re-use measures: Belgium, Cyprus, Germany, Greece,
Hungary, Ireland, Italy, Luxembourg, Malta, Romania, Spain, Sweden, United Kingdom. 3 Member States have used the combination of new measures specifically
addressing re-use and legislation predating the Directive: Austria, Denmark and
Slovenia. 9 Member States have adapted their legislative framework for access to documents to include re-use of PSI: Bulgaria, Croatia, Czech Republic, Estonia, Finland,
France, Latvia, Lithuania, Netherlands, Poland, Portugal, Slovak Republic” [4].
Deeper investigation of national EU member states law shows existing differences
of PSI definition. Some countries use PSI definition as “document”, “information”,
“data” or other.
These differences could be classified to those which are using: (1) same definition
of PSI as it is provided in PSI Directive (Austria (including Vienna, Vorarlberg, Lower
Austria, Tyrol, Styria, Salzburg and Upper Austria lands), Cyprus, Slovak Republic
(from 2012), Greece (from 2006 till 2014), Luxembourg and Spain) and (2) those
which have adopted specific definition (all others).
M. Mockus
It could be classified also to 4 groups: document group (definition of PSI is
strongly related to a document), information group (PSI is understood as some kind of
information), data group (PSI is understood as a data, record, file and etc.) and other
group (PSI is understood as representation of content, knowledge, matters and other).
A document group could be classified to the smaller parts: (1) Document (Austria
(including Vienna, Vorarlberg, Lower Austria, Tyrol, Styria, Salzburg and Upper
Austria lands), Cyprus, Slovak Republic (from 2012), Greece (from 2006 till 2014),
Luxembourg, Spain used the same definition as it is provided in PSI Directive;
(2) Documented information (Estonia defines it as information which is recorded and
documented. It means that information which is not documented is not under the scope
of PSI; Latvia it defines as “documented information – information whose entry into
circulation can be identified”); (3) Administrative documents (France and Portugal it
defines as “administrative documents”); (4) Documents, information and data (Greece
(from 2014) implements Revised PSI Directive and provides updated conception of
PSI: it is the documents, information and data which are made available online as a
dataset or via programming interfaces in open machine-readable format which complies with open standards); (5) Documents, record and data (Ireland it defines as
document and it means all or part of any form of document, record or data); (6) Document and any content (Romania it defines as a document and it means any content or
part of such content).
An information group could be classified to: (1) Information and metadata (Czech
Republic it defines as “publicly disclosed information”. Also includes metadata which
is named as “accompanying information”); (2) Any information (Bulgaria defines it as
any information collected or created by a public sector body); (3) Public information (It
is defined as public information in The Netherlands and Poland (all information about
public matters constitutes public information) and Slovak Republic (till 2012) used
very narrow definition of PSI limited to information only about public money,
state/municipality property and concluded agreements); (4) Information in the form of a
document, case, register, record and other documentary material (Slovenia it defines as
information originating from the field of work of the body and occurring in the form of
a document, a case, a dossier, a register, a record or other documentary material drawn
up by the body, by the body in cooperation with other body, or acquired from other
persons); (5) Information means content (UK 2015–2015 it defines as information and
it means any content or part of such content).
A data group could be classified to these parts: (1) Data (Croatia defines it as any
data owned by a public authority. It means that ownership of rights to data is important.
Hungary 2005–2015 it defines as data of public interest and data made public on
grounds of public interest); (2) Data collections (Denmark (from 2005) granted access
not only to document but also to data collections. Exception was made to information
produced for commercial activities of a public sector body’s, or for which third parties
hold a non-material right. “Data collection” means registers or other systematic lists for
which use is made of electronic data processing); (3) Files (Denmark (till 1985) granted
access to files only if (a) they were the substance of the authority’s final decision on the
outcome of a case; (b) the documents contain only information that the authority had a
duty to record; (c) the documents are self-contained instruments drawn up by an
authority to provide proof or clarity concerning the actual facts of a case, or (d) the
OGDL4M Ontology: Analysis of EU Member States National PSI Law
documents contain general guidelines for the consideration of certain types of cases);
(4) Any record (Germany it defines as any record stored in any way).
Another group consists of these parts: (1) Presentation and message (Finland it
defines as “written or visual presentation, and also as a message”); (2) Presentation of
acts, facts and information (Italy it defines as document and it means the presentation of
acts, facts and information); (3) Any representation of content (Vorarlberg land (of
Austria) till 2015 it defines as any representation of content, or part of it which
public-sector body may decide whether to allow reuse); (4) Representation of acts, facts
or information - and any compilation (Malta till 2015 it defines as document and it
means any representation of acts, facts or information - and any compilation of such
acts, facts or information); (5) Knowledge (Lithuania it defines as “document shall
mean any information; information shall mean knowledge available to a State or local
authority institution or body”); (6) Known factual statements on matters (Carinthia and
Burgenland lands (of Austria) it defines as factual statements on matters which at the
time of the request for information are known to the body); (7) Matter or recording and
compilation of information (Sweden it defines as a document and it means any written
or pictorial matter or recording which may be read, listened to, or otherwise comprehended only using technical aids. It also includes a compilation of information taken
from material recorded for automatic data processing).
Analysis of definitions shows the most EU Member States use different terms to
describe the Public sector information. Looking from open government data perspective it is not so important which term is used “document” or “data”, but is more
important to see can definition set extra limits which goes out of the scope of the PSI
Firstly, it is risky to limit PSI definition only to administrative documents or documented information. Because there are plenty of information held by public bodies
which are not administrative documents or just “documents”, “documented information” in bureaucracy terms. E.g. live traffic data from municipality’s sensors/cameras
do not fit the requirements of administrative documents.
Secondly, the ownership of information should be also avoided (ex. belongs to
public sector institution), because some works belongs to public domain and according
to Revised PSI Directive it should be provided (e.g. from archives, museums) as public
domain works. Also, there are discussions [7] held by open data community: does PSI
belongs to Public sector or it belongs to public domain (because it was produced by
public money).
Thirdly, it is a common mistake, that PSI is defined as information given to re-use.
E.g. “Document held by a public sector body: a “document” regarding which the public
sector body is entitled to allow re-use” [8]. PSI limitation to only information which is
provided for re-use by institution should be avoided, because it limits the right to get
access to information and initiative to ask for new information which is not provided by
institution. On other hand such limitation is right of each EU member country
according to PSI Directive recital 9: “This Directive does not contain an obligation to
allow re-use of documents. The decision whether or not to authorise re-use will remain
with the Member States or the public sector body concerned. This Directive should
apply to documents that are made accessible for re-use when public sector bodies
license, sell, disseminate, exchange or give out information” [2].
M. Mockus
Finally, implementation of Revised PSI Directive makes changes in PSI terminology, because PSI concept was updated by including metadata, open and machine
readable formats, and up-coming understanding what is open data. Example, Spain PSI
regulation from 2015: Document: All information or part thereof, whatever the medium
or form of expression, whether textual, graphic, audio visual or audiovisual, including
associated metadata and data content with the highest levels of accuracy and disaggregation [9].
There is a hope that the implementation of Revised PSI Directive will help for
Community to adopt definitions of PSI, which will be constructed to support open data
concept, e.g. as it did Greece [10].
Analysis of the Legal Requirements Applied to OGD Licensing in National PSI
Law. In each country all public sector data which is released as Open Government
data (or, in other words, PSI ready for re-use) is regulated by national PSI law.
Depending on the country there could exist also land’s (e.g. Wiener Informationsweiterverwendungsgesetz (WIWG)), municipality’s, public institution’s PSI laws, but
those laws follows the federal or national PSI regulation. Our analysis is limited to the
main national PSI regulation.
Analysis has discovered that there exist differences concerning legal requirements
applied to OGD licensing among EU Member States. Those differences in the most
cases are not significant and follows EU PSI Directive’s rules, but there exist some
contradistinctive, e.g. in Spain re-user of PSI could be fined up to 100000 Eur for
violation of re-use policy; in Croatia up to 100000 HRK/* 13000Eur could be fined
public authority which prevents or restricts the exercise of the right of access to
information and re-use of information.
In order to make those requirements understandable in machine-readable format,
primer version of the ontology has been developed.
3 The Ontology of Open Government Data Licenses
Framework for a Mashup Model (OGDL4M)
The Ontology of Open Government Data Licenses Framework for a Mashup Model
(OGDL4M) is an OWL ontology formalizing a legal knowledge of Open Government
data licensing Framework to represent legal requirements applied to open government
datasets in mash-up model. OGDL4M is still under development and we expect to
present it by the end of 2016. This section describes a part of OGDL4M which is
dedicated to present legal requirements for open government data licensing, terms of
use and sanctions for the violations which is coming from national re-use of public
information (PSI) laws of EU Member States.
State of Art
At the moment there are no similar ontologies representing EU Member countries national-level PSI domain, but there are ontologies which analyses licensing
OGDL4M Ontology: Analysis of EU Member States National PSI Law
(L4LOD [11], RDFLicense [12]), intellectual property (IPROnto [13], CopyrightOnto
[14]), linked data rights (ODRL v.2.1 [15]), legal norms (LKIF [16]) and expression
language ccREL [17].
Main scholars which are working on subject related to this ontology are
M. Palmirani [18, 19], S. Peroni, P. Casanovas [20], V. Rodríguez-Doncel [21],
S. Villata, F. Gandon, A. Kasten, D. Paehler, R. García, and J. Delgado.
Merged Ontologies
OGDL4M Ontology re-use some elements of other ontologies (Table 2):
Table 2. Merged ontologies objects
Attribution, CommercialExpl, NoCommercial, NoDerivative, ShareAlike
Exception, LegalPerson, LegalSource, Legal_Document, Natural_Person,
Obligation, Permission, Prohibition, Right
Action, CreativeWork
AttributionRight, DisseminationRight, EducationRight, InformationRight,
IntegrityRight, MoralRight, OfficialActRight, ParodyRight,
PrivateCopyRight, QuotationRight, TemporaryReproductionRight,
UserRights, Withdraw, WithdrawalRight
Time (ti)
The objective of this part of ontology is to help to create the theoretical model, which
will be able to inspire an automatic or the semi-automatic computational model that
could represent national law PSI rules of EU Member countries, especially when
licensing regime is not clear, or when conditions for re-use are not provided.
Formation of List of All the Relevant Terminology and Production
of Glossary
We have developed a table in which we indicate the terms, provide legal description,
legal source and normalized definition (Table 3).
Table 3. Example of the glossary
Definition by legal source
In respect of the expression of
the database which is protectable
by copyright, the author of a
database shall have the exclusive
right to carry out or to authorize:
Link to
96/9/EC 5.1(b)
convention §12
The act or process of
modifying of the
content of the
M. Mockus
Table 3. (continued)
Definition by legal source
Link to
translation, adaptation,
arrangement and any other
Authors of literary or artistic
works shall enjoy the exclusive
right of authorizing adaptations,
arrangements and other
alterations of their works.
OGDL4M consist of core part, which presents general concept, and other parts based
on each country profile.
In the Fig. 1 the fragment of core part of OGDL4M is presented. A class LKIF:
LegalSource should be indicated as a source of all possible regulatory sources which
could apply to dataset released by public sector. E.g. if information system wants to
evaluate what are legal requirements (Class ConditionsOfPSIReuse) applied to dataset
(class OpenGovDatasets), it must investigate all possible legal sources (class LKIF:
Fig. 1. The fragment of OGDL4M core part: legal source.
Classes LegalNotice, TermsOfUse and License represent forms of regulation which
are commonly used to express connection between dataset and legal regulation. Usually, by mistake those forms are applied without taking care of other important class
LKIF:Legal_Document which represent different regulation coming from different
legal areas: Personal data protection, Copyright law, EU Database sui generis right, and
PSI law which is divided to country level (national PSI law) and lands, municipality,
institutions PSI law level (localized PSI law).
OGDL4M Ontology: Analysis of EU Member States National PSI Law
In the Fig. 2 the fragment of core part of OGDL4M is presented, which explains the
model how different national PSI regulation could be explained. National PSI regulation provides rules which explain are those PSI re-use requirements are obligatory or
only recommended, or maybe those (some/all/none) requirements are not regulated by
national law, but must/could be regulated by local PSI law.
Fig. 2. The fragment of OGDL4M core part: general requirements.
Class NationalPSILaw represents National PSI law, which is legally binding and
sets general countries legal rules applied to re-use of PSI conditions. The class GeneralRequirements is subclass of NationalPSILaw and represents general countries legal
rules applied to re-use of PSI conditions. Those rules could be obligatory (class
ObligatoryGR) or only recommended (class RecommendedGR) to apply. In those
cases when rules are obligatory to apply, all other contra legal rules set on dataset is not
valid. E.g. in Finland OGD could be released only as part of public domain, so no other
rules can apply to OGD released by public institution in Finland, especially other
license which do not represents public domain (like cc-by), or if there is licence
missing it is clear that dataset is part of public domain.
In other cases when national PSI regulation only recommends to follow some rules,
usually PSI policy is dedicated to the lower authority. The class of SpecialRequirements is used to present link to local psi law (of land, municipality, institution or other
public authority) and limitation of possible use (without deeper analysis) of the
ontology for current country profile.
OGDL4M Model for the Country Profile
Legal requirements applied to OGD licensing in the national PSI law is modeled by
identifying which requirements are obligatory to apply and which are recommended.
Requirements are presented by identifying the legal source of the requirement (concrete
part of the law). It is necessary for quick cross-checking and evaluation is that norm
still valid. If there are sanctions of violation of PSI re-use policy class Sanctioning
Regime is used. In country profile ISO 3166 code is attached to PSILaw, Jurisdiction,
GeneralRequirements classes.
M. Mockus
Fig. 3. The fragment of OGDL4M representing Finland’s legal requirements to OGD.
In a Fig. 3 the OGDL4M model for Finland is presented. The class PSILawFI
represents legally binding Finland’s PSI law - Act on the Openness of Government
Activities with its amendments [22]. The model explains that general requirements
(class GeneralRequirementsFI) are set by Chapter 1 Sect. 1(1) of Act on the Openness
of Government Activities and it is applied obligatory. Legal requirement is only one
applied to OGD: PSI belongs to Public domain.
In Fig. 4 the OGDL4M model for Spain is presented. The class PSILawES represents legally binding Spain’s PSI law – Law on the re-use of public sector information it’s amendments [9]. General requirements (class GeneralRequirementsES) are
obligatory to apply. Model explains that: (1) there could OGD released by no
conditions/license (class NoConditionsForReuse) or (2) OGD could be regulated only
by standard license. Standard license has a bunch of conditions: license should be open,
not limit competition, not restrict re-use and etc. The model explains that there could be
only two licensing regimes in Spain, but in reality we found 33 during the Survey.
Licensing regimes which do not follow Spain’s PSI law’s regulation are not correctly
Fig. 4. The fragment of OGDL4M representing Spain’s legal requirements to OGD.
In Fig. 5 specific conditions for re-use is presented. Those conditions basically
implement similar to non-derivative license conditions (cannot be altered). It means
that licensed OGD released by public authority cannot be used in mash-ups in Spain.
There is a conflict of legal norms which requires not limiting re-use of PSI and asks for
not altering the PSI. The conditions which limits PSI re-use are supported by sanctions.
OGDL4M Ontology: Analysis of EU Member States National PSI Law
Fig. 5. The fragment of OGDL4M representing Spain’s legal requirements to OGD.
Fig. 6. The fragment of OGDL4M representing Spain’s legal requirements to OGD.
In Fig. 6 sanctioning regime is explained. If OGD released by Spain with a license,
those sanctions should apply, e.g. failure to indicate the date of the latest update of
information will cost to developer from 1000 to 10000 Eur.
4 Conclusions and Future Work
The legal analysis of EU Member States national PSI law has indicated the main
problems: national law is not harmonized with the EU law, that’s why situation in most
EU countries is different and requires deeper analysis of the national legal domain.
OGDL4M ontology could be a very useful tool for evaluating country’s PSI policy, and
could be used as a tool for automatic or semi-automatic evaluation of the legal regulation of datasets released by the public bodies of EU Member countries in the future.
Moving forward we expect to enrich the ontology and present the completed
version of OGDL4M by the end of 2016.
M. Mockus
Acknowledgements. This research is funded by the ERASMUS MUNDUS program LAST-JD,
Law, Science and Technology coordinated by University of Bologna and supervised by Prof.
Monica Palmirani.
1. Mockus, M.: Open government data licenses framework for a mashup model (2014)
2. Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003
on the re-use of public sector information (OJ L 345, 31.12.2003 p. 90). European
Parliament and of the Council
3. Directive 2013/37/EU of the European Parliament and of the Council of 26 June 2013
amending Directive 2003/98/EC on the re-use of public sector information
4. European Commission’s Directorate General for Communications Networks, C.&T.:
Implementation of the Public Sector Information Directive. https://ec.europa.eu/digitalagenda/en/implementation-public-sector-information-directive-member-states#how-haseach-eu-member-state-implemented-therules
5. Recommendation of the council for enhanced access and more effective use of public sector
information. OECD (2008)
6. Borglund, E., Engvall, T.: Open data? Data, information, document or record? Rec. Manag.
J. 24, 163–180 (2014)
7. License or public domain for public sector information? - Creative Commons. http://
8. Gesetz über die Auskunftspflicht, die Weiterverwendung von Informationen öffentlicher
Stellen sowie die Statistik des Landes Burgenland (Law on the re-use of public sector
information and the statistics of the land of Burgenland), LGBl. N 14/2007, 12/02/
9. Law No 18/2015, of 9 July 2015, amending Law No 37/2007, of 16 November 2007, on the
re-use of public sector information. King of Spain (2015)
10. Amoijsή diάherη jai peqaisέqx vqήrη eccqάuxm, pkηqouoqiώm jai dedolέmxm sot
dηlόriot solέa, sqopopoίηrη sot m. 3448/2006 (A΄ 57), pqoraqlocή sη1 ehmijή1
moloherίa1 rsi1 diasάnei1 sη1 Odηcίa1 2013/37/EE sot Etqxpaϊjoύ Koimobotkίot jai
sot Rtlbotkίot, peqai
11. Villata, S., Gandon, F.: L4LOD vocabulary specification. http://ns.inria.fr/l4lod/v2/l4lod_v2.
12. Rodríguez-Doncel, V., Villata, S.: RDFLicense. https://datahub.io/dataset/rdflicense
13. Delgado, J., Gallego, I., Llorente, S., García, R.: IPROnto: an ontology for digital rights
management. In: Legal Knowledge and Information System. Jurix, pp. 111–120 (2003)
14. Rhizomik: Copyright Ontology. http://rhizomik.net/html/ontologies/copyrightonto/
15. Iannella, R., Guth, S., Paehler, D., Kasten, A.: ODRL Version 2.1 Core Model. https://www.
16. Breuker, J., Hoekstra, R., Boer, A., van den Berg, K., Rubino, R., Sartor, G., Palmirani, M.,
Wyner, A., Bench-Capon, T., Di Bello, M.: LKIF-Core Ontology. http://www.estrellaproject.
17. Abelson, H., Adida, B., Linksvayer, M., Yergler, N.: ccREL: the creative commons rights
expression language (2008)
18. Palmirani, M., Governatori, G., Rotolo, A., Tabet, S., Boley, H., Paschke, A.: LegalRuleML:
XML-based rules and norms. In: Olken, F., Palmirani, M., Sottara, D. (eds.) RuleML 2011.
LNCS, vol. 7018, pp. 298–312. Springer, Heidelberg (2011). doi:10.1007/978-3-642-249082_30
OGDL4M Ontology: Analysis of EU Member States National PSI Law
19. Palmirani, M., Girardi, D.: Open government data: legal, economical and semantic web
aspects. In: Lawyers in the Media Society: The Legal Challenges of the Media Society.
Rovaniemi, University of Lapland Print. Centre, pp. 187–205 (2016)
20. Casanovas, P., Palmirani, M., Peroni, S., van Engers, T., Vitali, F.: Semantic web for the
legal domain: the next step. Semant. Web. 7, 1–15 (2016)
21. Rodríguez-Doncel, V., Santos, C., Casanovas, P., Gómez-Pérez, A.: Legal aspects of linked
data – The European framework. Comput. Law Secur. Rev. (2016, in press)
22. Act on Transparency in Government (1999, as amended)
Customer Relationship Management
Social Media and Social CRM
Antonín Pavlíček and Petr Doucek(&)
University of Economics, Prague,
3 W. Churchill Sq., 130 67 Prague, Czech Republic
Abstract. The main aim of the paper is to analyse social CRM, specifically
Facebook communication of mobile operators in the United States, the Czech
Republic, and France to examine the state of social customer care on social
networking sites and consider possibilities, need for automatization and
improvement of EIS. The analysis is based on messages and answers posted on
Facebook pages and measuring the response time on over 1.3 million unique
questions. It identifies trends, looks for certain repeating patterns or correlations
and as a result offers a comprehensive report on the current use of social media
as a channel for customer care amongst mobile operators. The theoretical
background will also propose an advice on how to maintain a healthy relationship with customers on social networks and add a real value both for customers and company.
Keywords: Customer relationship management Customer care Social
media Social networking sites Social CRM Enterprise information system Facebook Good practice Telecommunications Mobile operators Response
time Czech Republic France USA
1 Introduction
There is no doubt that social networking sites (SNS) have recently become new
communication standard in e-society [4]. Thanks to its versatility SNS can be used to
find the job [2], romantic partner, latest gossip [5], play interactive games [12] or just
have private [10] conversation with friends and acquaintances [9]. Facebook is the
main on-line community communication channel [13] not only for teenagers or young
college students but also for the vast majority of working population in developed
countries. Its growing popularity resulted in more than 1.65 billion Facebook users [7].
Companies follow this trend by investing into proper maintenance of their profiles on
social networks but quality social media content is just one part of success. The other
part – maybe even more important – makes the core of this paper: interaction with
Contacting a company over SNS is just one or two clicks away. It is fast, free and
public so anybody can see the response verified by thousands of visitors, who can
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 77–87, 2016.
DOI: 10.1007/978-3-319-49944-4_6
A. Pavlíček and P. Doucek
assess the answer and judge the company. That is the main reason why customer care
on social networks is becoming crucial.
2 CRM and Social Media
A systematic review of the state of the customer relationship management (CRM) [11]
defines the customer relationship management as “The overall process of building and
maintaining profitable customer relationships by delivering superior customer value
and satisfaction with the goal of improving the business relationships with customers.
Also, it is the strongest and the most efficient approach to maintaining and creating the
relationships with customers.” A company should care about its customer by answering
their request and solving their problems. CRM can also involve in business intelligence
factors by observing the expectations of the customers. Both above-mentioned goals
can be supported by social media.
The modern approach to CRM began in the 1990s with Sales Force Automation
(SFA) that enabled companies to hone their sales processes and boost productivity.
CRM continues to evolve and by the end of the century the world’s first mobile CRM
solution is introduced, followed by the first ever Software-as-a-Service (SaaS) CRM
Since the spreading of the internet in the beginning of the 21st century, customers
have become more informed about the prices, the products, the services, the competitors and as a consequence are more demanding. Another point was described by
Babinet [1], who explained that nowadays the worth of a product comes more from the
additional services it can provide, rather than the real technical capacities. He called
that the “revolution in the worth process for the companies”. Companies begin to see
CRM as a way to manage all business relationships via a single platform. However,
companies had difficulties at the time calculate return of investments [6] and to adapt to
this new demands and often did not get the stake of a performant CRM software or on
the opposite wanted to develop a lot of functionalities but without strategy and as a
result they got too many data that they did not know how to use.
During the recent years, successful companies have built a real CRM strategy:
especially companies that grew up very fast, such as Uber, Airbnb, Booking, Apple …
usually offer an amazing customer care. Indeed, companies that began succeeding were
working both front and back systems and their goals were to link these systems
together, as a system that can be used by customers and business partners. The CRM
scene is expected to be worth £36.4 billion by 2017 according to Gartner [3].
We can identify two main ways to deal with customer care: customer interactions/
contact software (CIS) and Customer Relationship Management (CRM) software. CIS is
more flow oriented whereas the CRM system really focusses on the customer.
A customer interactions software enables to optimize the dispatch of the demand
stream (input) accordingly to the available resources to deal with these requests. It is
typically what is going on in call centres but some developers are trying to apply the
same procedures in chat, emails or social network. The main goal is to split the depends
in order to organize the waiting line in an optimal way, with more or less sophisticated
Social Media and Social CRM
allocation rules. The operating is very simple: when a customer contacts company, no
matter how he does it, a ticket is created. Then the ticket is handled automatically and
randomly by any employee.
A customer relationship management software tries to put the customer in the
centre of the process – it creates a database where all the information are gathered:
emails, phone calls, mails, fax… It fits especially to the companies, which wish to
know their clients in order to offer customized service and speech. Some CRM software also enable to allocate the request to the right customer adviser, according to its
language and competencies.
In this respect we can point to two main advantages of a CRM software compared
to a customer contacts software. First and foremost, CRM enables to follow the evolution of the client. That means a lot for the client as they really appreciate to be
recognized and not only considered as numbers: the way of answering cannot be the
same for a client that contact the company for the first or the fifth time on the same
topic. Social media can enhance such individual approach. Secondly, the CRM software enables to enhance the knowledge about clients: customers want adapted answers
for their request but also for the products and services that are offered. Gathered in the
CRM software, the marketing department can also use them to analyse the desires of its
customers. Feedback and verbatim are very precious sources for the marketing to
identify new segments, social media are ideal partner for this.
Our paper focuses on the new possibilities of social communication, which
emerged with the rise of Facebook, Twitter and other SNS, while trying to answer,
whether and how Enterprise Information System (EIS) can be involved.
Customer Care on Social Networks: Social CRM
Social Customer Relationship Management (SCRM) does not replace CRM but it
extends its application field by integrating the data from social networks and other
digital exchange places.
In March 2008 Comcast was one of the first companies to interact directly with the
customers via Twitter. Social media marketing was beginning to grab the attention of
organization as the use of Facebook began to spread rapidly through the world.
Moreover, companies began to see the power of social media marketing to attract
customers. In return, customers began to see how they could use the web to be able to
express their thoughts about companies online: organizations became, therefore, aware
that customers do not bring them feedback directly but instead post their opinion
publicly online. In this respect, CRM vendors started to develop systems that would
address the issue created by social networks.
After 2010, CRM begins to integrate all departments: the companies has understood that the concept of CRM is strategic and companies of all size implement CRM
software. As a consequence, social CRM needs to be much more integrated into
customer management systems and approach. Social CRM was estimated to comprise
8% of all global CRM spending during 2012, which has doubled since 2010, according
to Gartner [3].
A. Pavlíček and P. Doucek
3 Research Questions
In our research we have, by observing actual events on SNS, tested limits of meaningful SCRM usage and whether/how Enterprise Information System (EIS) can be
useful support tool. The paper seeks to find answers to the following research
1. Is really Social CRM in practice used so extensively as claimed by theoretical
2. What mood/sentiment on SNS prevails? Is the medium predominantly negative?
How SCRM deals with it?
3. Doesn’t Social CRM a create problems due to expected short response times?
Should be the SCRM process automated or computer aided?
4 Research Methodology
We have decided to analyze mobile network operators (MNO) since this industry is
closely connected with modern technologies, companies are quite big and rich, their
product is commoditized so the marketing and CRM are of vital importance. Also,
from the data perspective – MNOs usually dispose of large fan communities on
Facebook and their customers often face plenty of – often repetitive – problems that
need to be solved. We analyzed official Facebook pages of three leading,1 MNOs from
the Czech Republic, France, and the United States (see Table 1) and collected public
users’ questions and all the answers with all possible metadata, primarily time of the
answer and response, that might show us the current state of social customer care, its
development over time, differences between countries and individual operators and
finally usage of Facebook as a place to solve customers’ problems.
Altogether, we have managed to legally download over 1.3 million unique questions with all of the consecutive reactions. Through six year period (2010 to 2015) we
have measured the time that it took customer care staff to answer the question and even
to answer the following questions and also absolute and relative totals of questions
asked and answered as well as the difference between these two. As a factor representing the quality of the answer, we counted questions that were successfully solved
by one reply and related this number to all answered questions. These data were related
to the time of the day and months and years and tested for repeating patterns or
interesting formations. Each factor is also compared by individual operators amongst
each other and aggregated by countries.
For qualitative analysis, we have analyzed 100 most popular posts (Likes & Shares)
for each company (N = 900).
With one exception in the USA, where the largest one – Verizon – does not have its Facebook page
opened for public posts.
Social Media and Social CRM
Table 1. Number of fans and followers on Facebook and Twitter as of November 2015 - in
thousands (K) and millions (M) (authors)
Czech Republic
Vodafone T-Mobile SFR
Orange Bouygues AT&T T-Mobile Sprint
FCB 185 K
183 K
197 K 927 K 9.4 M
826 K 5.8 M
5 M 2.1 M
TW 8.5 K
39 K
5.5 K 52.2 K 123 K
82 K 711 K
575 K 345 K
Data Gathering
Facebook provides two different ways to gather data from their social network, either
query language FQL or newer API called Graph API. Both of these meet our needs.
There are many online tools for Facebook data analysis, in our case, however, there
was no suitable solution on the market, as our requirements were a bit specific. So we
have used Power Query - an extension for Microsoft Excel which has built-in support
for gathering data from Facebook.
Some operators do not have their Facebook page available for posting messages and
rather use their own custom web application for customer care which makes impossible
for us to analyse any desired information.
Also, the amount of fetched data might be limited by Facebook cut-off policy when
carrying out too many requests consecutively. Due to this problem, there were a few
minor gaps in downloaded data – estimated at 5%–10% from the complete set.
Average response time could have been influenced by long-time unanswered posts
later commented by other user and just then answered by MNO. Response time, in
this case, would be extremely high – so we have removed those cases from
Minimal response values of the second and third answer were sometimes as low as
one second – when clients’ response is commented by someone else at the same time as
by the operator itself – such records were also removed so they don’t skew the results.
We were also a disappointed by Twitter’s policy as it was originally planned to carry
out the same research there. Unfortunately, their API does not provide
full conversations with all the replies, which makes such kind of data mining impossible.
5 Results and Discussion
SCRM level was measured by the extent and delay of response to customers’ stimuli.
Table 2 shows the results for all MNOs, with the overall response rate is just about
23% (320.682 out of 1.367.586). On average, MNOs answered only every fourth
question raised on SNS. The highest answer rate had Orange in France with 49% of
answered questions. The worst response rate was observed with Sprint in the USA with
90% of unanswered questions.
A. Pavlíček and P. Doucek
Table 2. Overall statistics (authors)
Orange FR
Telecom FR
Sprint USA
Number of Questions
Average Minimum Maximum
67% 4:22:44
67% 5:49:56
0:00:38 128:29:18
51% 15:51:48
60% 31:04:42
82% 14:58:57
Incoming SCRM Communication
Total message chart (Fig. 1) shows total incoming communication activity from users
(messages placed monthly on MNO’s Facebook page) through six year period. 20.000
messages were surpassed in April, June and October 2013 by Sprint – with approximately two-thirds of cases were complaints and negative reactions. In comparison with
T-mobile and AT&T, Sprint had the worst positive/negative ratio. In France, the
highest peak reached 13.297 messages in February 2012. Czech market peaked in May
2013 with 4.153 questions/comments.
Czech market is quite specific. There was the very slow beginning of customer care
implementation on Facebook for all MNOs: Vodafone was the first one to use, promote
and track Facebook in early 2011. Until today we can trace their difficult beginnings on
Facebook – reflecting the fact, that innovations of IT in telecommunication causes a lot
of unexpected situations – but surprisingly Vodafone CZ kept improving the quality of
services, and reduced the problems.
In January 2012 O2 CZ offered Facebook users chance to win mobile phones for
free. That spurred the growth of active O2 customers and from that moment O2 CZ
started to compete of Facebook with other providers. T-Mobile CZ and Vodafone CZ’s
total messages softly decline, but O2 took lead in January 2014 again, this time by
harnessing the power of “influencers” who started big wave of interactions on O2
Facebook page with a lot of likes, shares, haters, and lovers and boosted total messages
by one-fifth every month.
Social Media and Social CRM
Fig. 1. Total message count in time (authors)
By the end of 2015, no MNO received more than 3.000 messages per month. Such
reduction of the questions and answers can be explained by more “formal” behaviour
and removing of the problems from Facebook pages, but still, there is enough data
comparing the questions and helpfulness of answers that put T-Mobile as a dominant
SCRM player. Sprint USA as the only provider in USA market couldn’t handle the
situation and restricted the rights of Facebook users to contribute on its page – practically acknowledging SCRM defeat.
The Frequency of SNS Posts During Day
Figure 2 represents total aggregated number of questions asked and answered during
each hour of a day. The highest amount of questions was asked expectedly on the USA
market, as it is by far the largest one, between 11 a.m. and 9 p.m., peaking at 7 p.m.
(58.781). Questions in France peaked at 21.103 during 11 a.m. and in the Czech
Republic during the 1 p.m. (12.686).
Fig. 2. Questions during a day
A. Pavlíček and P. Doucek
On the other hand, it is not surprising that 4 a.m. is the time with an overly lowest
number of questions common for all the markets. Question are not limited to business
hours, we can witness customers asking the questions almost around the clock.
Response Time
Figure 3 shows average delay of the first reaction on message during the day. The
x-axis reflects time when was message posted, y-axis describes the average response
delay. Response time is crucial SCSR parameter, it expresses speed of reaction,
responsibility, and reliability in any case or situation. The worst overall reaction time
during the day has SFR in France with an average of 24 h (with best average reaction
takes more than 9 h). There is a visible difference between French approach to SCRM
and other markets.
Whereas an average response time in the Czech Republic and the USA varies
between half an hour and three hours, MNOs in France are literally off the scope,
beginning with SFR at about 6 h and ending with Bouygues Telecom and Orange at
over 12 h. The highest peak was hit during the beginning of 2015 on SFR page with
the average of over 84 h (more than 3 and half days).
Generally, bad performance of French mobile operators seems to be caused by
cultural differences, as all examined operators tend to perform similarly and obviously
with no effect on a number of customers.
Czech and USA MNOs react within the same day. The slowest average reaction has
been 9 h and fastest comments appear in less than 60 min. The worst results are
achieved at the midnight while constantly fastest reaction is during entire “office hours”
from 5 a.m. until 9 p.m. The fastest average time of answers during the day performs
around 3 p.m. and reaches 64 min. The slowest answers come from SFR in France:
messages placed at midnight wait 41 h and 45 min to be answered.
Secondly, we can see a stable trend of declining response time through all the
operators in our study. Over the time, there are often clearly visible points, when
certain operators started to pay more attention to Facebook customer care. This trend
can be seen for example in the first half of 2013 on the page of Sprint, AT&T did the
same in the beginning of 2014. The same happened on the Czech market, surprisingly
earlier than in the USA, Vodafone started that in early 2012, T-mobile CZ during 2012
and early 2013 whereas O2 CZ kept decent response time since the beginning of our
data in 2011, but the overall winner would be the T-mobile USA with constantly
prompt answers since 2012.
Generally positive fact is, that operators tend to adapt to their customers’ behaviour
as they keep answering asked questions till late evening or even during the night and do
not follow the typical 9 to 5 business day schedule.
Social Media and Social CRM
Fig. 3. Average response delay of the first answer (authors)
Sentiment Analysis
As asked in the second Research question: “What mood/sentiment on SNS prevails? Is
the medium predominantly negative?” For each MNO, we have analysed top 100 posts
(900 in total) with the most shares and likes with results shown in the Fig. 4. Our
findings confirm that Facebook is predominantly negative medium.
In the Czech Republic – majority (56%) of the most popular post just identify
problem areas and highlight particular “bad experience” and confirm negative reactions
and statement offending the operator. The most frequent argument is the high price of
contract packages comparing to Germany and Austria and the gap between the quality
of service and its price. Second in frequency of likes are complaints focused on signal
coverage. Billing errors proved to be a popular topic too – one customer even paid
advertisement to promote his negative post.
In the USA discussion seems to be more extreme: users tend to like or dislike
MNO, neutral posts are rare. Out of most shared&liked posts only about 4% are
neutral, 57% negative and 39% positive. Contrary to that, French users are neutral in
almost half of cases (47%), 45% are negative and just 8% are positive (Table 3).
Our finding is consistent with the Galtung and Ruge’s theory of News Values,
where they identify negativity as one of the most attractive factors of media
communication [8].
Enterprise Information System (EIS) and Social Media
Developers have been quick to take the initiative, adding various social media packages
to ERP systems. Let’s have a look how the integration of social media with ERP or EIS
should be constructed. If we integrate as additional input to the system social media
stream (Facebook, Twitter, LinkedIn and others), we should than also consider, how to
automatically analyse it.
A. Pavlíček and P. Doucek
Fig. 4. Positive/negative sentiment of comments – in % (authors)
Table 3. TOP 100 posts evaluated on positive/negative scale (authors)
Extremely positive Positive Neutral Negative Extremely negative
Czech Republic 9
The Data analysis phases should include Preprocessing, Representing Social Media
Data in convenient form, Definition of some similarity metrics and result in the
Clustering process. Text analysis platforms should be used, together with some relevant
As a result, we can benefit from various trend detections and visualisations, in the
form of Patterns, Cluster results, Data Cubes or Graphs.
The matter is even more complicated by the fact, that most of above mentioned
tools are language-related, so such automated system extremely difficult localize for
small markets (simple translation does not work).
6 Conclusions
We have proved, that Social CRM was in practice used quite extensively in 2012–2013
with declining number of user inputs lately. As we expected, mobile operators did their
best to catch the trend of social networks and over time they clearly invested in
improving the quality of customer care on Facebook (shortening the response time).
There is still a possible opportunity to gain customers by good social CRM as none of
the observed companies does perform ideally.
Our analysis also strongly suggested, that most of the social interactions are still
being achieved “manually” – by customer care staff writing the posts. The timeline of
interactions shows no signs of automated response system being used. Up-to-date
Social CRM requires ever shorter response times, and probably the only way how to
achieve it is the implementation of automated or computer aided SCRM processes.
However, we admit, that such system would be a great challenge [4] to program into
existing Enterprise Information Systems.
Social Media and Social CRM
Lastly, we have proved, that Facebook is predominantly negative medium, where is
quite difficult to perform a positive and successful SCRM.
Further research could be conducted to determine the effectiveness of SCRM in
addressing diverse groups of Facebook users. Also, qualitative content analysis of the
data we acquired would be interesting, albeit extremely time-consuming. In further
paper, we should conduct on both quantitative and qualitative analysis of the data we
have accumulated.
Acknowledgments. The paper was processed with the contribution of long-term support of
scientific work on Faculty of Informatics and Statistics, University of Economics, Prague and
help of the students in the course 4SA526 New Media.
1. Babinet, G., Orsenna, E.: Big Data, penser l’homme et le monde autrement. Le Passeur
éditeur, Paris (2015)
2. Boehmova, L., Novak, R.: How employers use Linkedin for hiring employees in comparison
with job boards. In: Doucek, P., et al. (eds.) IDIMT-2015: Information Technology and
Society Interaction and Interdependence, pp. 189–194 (2015)
3. Columbus, L.: Gartner Predicts CRM Will Be A $36B Market By 2017. Forbes. http://www.
4. Doucek, P.: E-society - perspectives and risks for European integration. In: Chroust, G. (ed.)
IDIMT-2004, pp. 35–42. Universitatsverlag Rudolf Trauner, Linz (2004)
5. Dunaev, J., Stevens, R.: Seeking safe sex information: social media use, gossip, and sexual
health behavior among minority youth. J. Adolesc. Health 58(2), S93–S93 (2016)
6. Erdos, F., et al.: The Benefit of IT-investments: technological and cost-return-benefit
approach. In: Gerhard, C., et al. (eds.) IDIMT-2008. Universitatsverlag Rudolf Trauner, Linz
7. Facebook.com: Facebook Statistics. http://newsroom.fb.com/company-info/
8. Galtung, J., Ruge, M.H.: The structure of foreign news. J. Peace Res. 2(1), 64–91 (1965)
9. Pavlicek, A.: Social media - the good, the bad, the ugly. In: Doucek, P., et al. (eds.)
IDIMT-2013: Information Technology Human Values, Innovation and Economy, pp. 139–
149 (2013)
10. Sigmund, T.: Privacy in the information society: how to deal with its ambiguity? In: Doucek,
P. et al., (eds.) IDIMT-2014: Networking Societies - Cooperation and Conflict, pp. 191–201
11. Soltani, Z., Navimipour, N.J.: Customer relationship management mechanisms: a systematic
review of the state of the art literature and recommendations for future research. Comput.
Hum. Behav. 61, 667–688 (2016)
12. Sudzina, F.: Escapist motives for playing Facebook games: fine-tuning constructs. In:
Doucek, P., et al. (eds.) IDIMT-2013: Information Technology Human Values, Innovation
and Economy, pp. 151–158 (2013)
13. Vondra, Z.: Explanation of multimedia communication using CATWOE analysis. In: Petr,
D., et al. (eds.) IDIMT-2015: Information Technology and Society Interaction and
Interdependence, pp. 311–318 (2015)
An Approach to Discovery
of Customer Profiles
Ilona Pawełoszek1(&) and Jerzy Korczak2
Częstochowa University of Technology, Częstochowa, Poland
[email protected]
Wrocław University of Economics, Wrocław, Poland
[email protected]
Abstract. The goal of the paper is to present the opportunity of exploiting data
analysis methods and semantic models to discover customer profiles from
financial databases. The solution to the problem is illustrated by the example of
credit cards promotion strategy on the basis of historical data coming from the
bank’s databases. The database contains information, personal data, and transactions. The idea is founded on data exploration methods and sematic models.
With this purpose in mind, multiple algorithms of clustering and classification
were applied, the results of which were exploited to elaborate the ontology and
to define the customer profile to be used in decision-making.
Keywords: Customer profile
Semantic models
Data mining
Ontology of marketing
1 Introduction
In the age of personalization, one of the greatest challenges for marketers is eliciting
and communicating customer requirements. A customer profile — also known as a
customer persona — is a set of data describing a high-level abstraction model that
depicts the key characteristics of a group of consumers who could be interested in a
specific product. Personas are fictitious, specific, concrete representations of target
users [1].
Construction of customer profiles promotes overall internal alignment and coordination of marketing strategy with product development which helps to drive down
cost of promotion by reducing the number of useless marketing messages and ineffective contacts with customer. Although making use of customer feedback is an
established method of gathering marketing intelligence, interpreting data obtained by
traditional structured methods such as questionnaires, interviews and observation, is
often too complex or too cumbersome to apply in practice [1, 2].
Today, a lot of data can be gathered in an automatic way from transactional
systems. These data describe the features of customers (such as age, the place of
residence, number of family members, etc.) as well as behaviors (such as time and
purpose of transactions performed by the customer, amounts and frequency of
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 88–99, 2016.
DOI: 10.1007/978-3-319-49944-4_7
An Approach to Discovery of Customer Profiles
The discovery of useful knowledge from databases is one of the major functions of
analytical decision support systems [3, 4]. Various data mining algorithms can be used
to discover and describe behavior patterns of individuals and to relate them with
personal data collected in CRM data bases.
Unfortunately, most of the systems lack semantics, which can be considered a key
component for business knowledge management. Marketing databases are today very
broad. The volume of information is so huge that the analysis by classical database
methods is becoming increasingly difficult. In recent years a few solutions have
appeared improving the process of database exploration and the integration of
semantics in decision-making processes [5–11].
A few proposals for marketing ontologies have appeared in recent years. Barbu [12]
presents General Marketing Ontology, in which the main concepts originate from
sources such as Web marketing repositories and tools allowing for the semi-automatic
creation of ontology on the basis of documents important to managers of SMEs.
Saggion describes a project of integration in the BI system, founded on the module of
gathering information in an e-business domain using ontology and natural language
[13]. The documents describing profiles of companies are collected from different
sources, analyzed, grouped and annotated within the ontology. The semantic models
are also exploited in many search engines dedicated to marketing information, i.a.
Magpie, KIM, SemTag, On-To-Knowledge, Vision, h-TechSight [14–16]. The methods of data exploration have become a very popular tool used in marketing strategies
[17–20]. It is a well-known fact that the effectiveness of marketing activities depends to
a large degree on directing relevant advertisements to the right recipients. One of the
essential functions of customer relationship management is customer segmentation
which is the process of dividing the customer data set into distinct and internally
homogeneous groups in order to develop differentiated marketing strategies according
to their characteristics [21]. With this aim in mind, a segmentation of customers
consists in finding customers with similar preferences, needs and behavior. A segment
corresponds to target class of customers, for example, travelers that are people with
high income who travel often. In case of large data sets the experiential classification
can be insufficient to distinguish segments. In such situations data mining methods such
as clustering and decision trees can be helpful. The terms of cluster and segment are
often used interchangeably, which is not always the same.
Clusters are groups of customers with similarities found by a clustering algorithm.
Therefore, a cluster not necessarily represents a segment, so the semantic interpretation
of clusters is required and might be difficult. It should be noted that in many cases the
segment can consist of a few clusters.
The segmentation of customers not only facilitates choosing a right product, but
also allows one to provide the customers with information that would be interesting for
them. The new knowledge obtained from data exploration processes and presented
as ontology on one hand describes concepts and relations, and on the other facilitates
access to the important and useful information in the marketing database for a
manager [22].
I. Pawełoszek and J. Korczak
Information on the place of the customer’s residence is currently not enough for
product promotion purposes. The company should know also who the customer is,
what his or her characteristics are, what is valuable, what is attractive, and what is
completely uninteresting or even annoying. For example, to target the advertising to
young people, it is better to prepare a marketing campaign on the Internet, but the
internet advertising may not be so effective in the case of a product targeted to older
people, who usually are not familiar with computers.
Marketing applications increasingly often use customers’ psychographic profiles
[23, 24], developed on the basis of information which is continuously collected during
interactions between the company and its customers. The companies seek to register
every possible communication record of the customer (phone calls, logging into the
system, loyalty cards, using mobile applications, payments, history of transactions,
etc.). Determining the customer’s needs in advance, and an individual approach to each
of them, is becoming the aim of marketing activities of large organizations.
The structure of this paper is as follows. In Sect. 2, the problem of customer profile
discovery is defined. In the next section, the sources of data necessary to build classification and prediction models are presented, with particular emphasis on clustering
algorithms and classification trees. The last section presents the case study of data
exploration methods along with semantic models and their role in developing the
customers’ profiles in the context of promotion of payment cards.
2 The Problem of Developing the Customer’s Profile
Many approaches to building a marketing strategy are described in the literature
[17, 19, 25–27]. In this paper, in accordance with marketing theory literature, the
creation of customers’ profiles is emphasized as an essential tool of marketing strategy
implementation [28].
The focus of the project was put on discovery of the bank customer’s profile with
the aim of identifying potential recipients of dedicated payment cards. Generally the
main source of information in the research on a customer’s profile is the history of
transactions on the customer’s bank account. Usually, historical data is complemented
by personal information (such as: age, gender, place of residence, number of children)
which has undeniable influence on the consumer’s preferences and thus may constitute
a good base for the customer’s classification and for choosing a suitable payment card.
This information, its scope and quality, has a significant impact on the predictive
quality of the developed model.
Well-targeted promotion is a key determinant of the effectiveness and efficiency of
a marketing campaign. The initial choice of a product, made on the basis of information
about consumers’ preferences, can reduce the time of marketing phone calls, and
ensure the right choice of communication channels to reach the potential customers. It
reduces ineffective contacts with customers who might not be interested in a particular
offer. Finally, better communication influences the effective utilization of resources,
which should decrease the costs of promotion.
An Approach to Discovery of Customer Profiles
On the payment cards market there are a large number of products dedicated to
various groups of customers, for example: students, people who travel often, seniors,
young parents, etc. To make these considerations more precise, let us assume that the
bank is going to propose payment cards to its customers, corresponding to the following customer segments:
– “Travelers” a dedicated credit card for frequent travelers, offering all kind of
additional discounts on airline tickets, accommodation, hotels, insurance.
– “Eternal students”- a credit card dedicated to the so-called “eternal students”,
offering extra discounts on concerts and events.
– “Still young” – a credit card dedicated for seniors, offering discounts for coffee bars,
additional health insurance, free medical exams.
– “High heels” – a credit card dedicated to women, offering additional discounts to
shoe stores and clothing.
– “Business card” – a card dedicated to business people, offering additional discounts
for elegant clothing stores or tickets in Business Class.
The five segments have been briefly characterized by marketing analysts on the
basis of their professional experience. Therefore the goal of the project was to improve
the scope and specification of segments using data mining methods.
The specificity of the offered cards has decisive influence on the selection of data
for analysis. For example, the “Travelers” card should be interesting to young or
middle-aged people who are financially well-off. Their transaction history is dominated
by multiple expenditures for touristic purposes, such as tickets, travel agency services,
expenditures and mobile top-ups made from abroad. The target group for “Eternal
students” cards should be addressed to young people, not having children (their profiles
are characterized by i.a. low expenditures on articles for children). Considering the
discounts for various events offered, this card can be proposed to people whose
transaction history reveals a party lifestyle. The “Still young” card should be offered to
older people who have significant expenditures on healthcare services. The “High
heels” card should be interesting mainly for middle aged women. Important information for choosing this type of card will be whether the customer has children and
therefore increased expenditures on children’s articles and healthcare services. The
potential customers for the “Business card” are middle-aged people, settled down and
having substantial income.
3 Description of Data
In the project, the data about customers was extracted from transaction-oriented systems and mobile banking applications. The experimental data file contains 200 000
anonymized records of customers, describing their personal data, bank products at their
disposal, incomes, expenditures and financial transactions The data underwent statistical analysis and initial transformation. More detailed information about the data can
be found in [22].
I. Pawełoszek and J. Korczak
Taking into consideration the selected algorithms of data exploration, the data were
converted to numerical values, including continuous, discrete or binary types. The
attributes of very low variances and redundant ones were eliminated. In the case of
missing values, the descriptions were completed using the nearest neighbor algorithm
[20, 29]. On the file prepared in such a way, the normalization was performed which
was necessary for heterogeneous variables. As a result of the initial transformations and
features reduction, the set of 200 000 observations with 24 variables of the interval
[0,1] was given for the further phases of data exploration.
4 Process of Determining Customer Profile
The study assumes that the bank intends to expand its offer for customers of one of five
dedicated payment cards. The promotion is intended to avoid the phenomenon of
miss-selling, or incorrect identification of the targeted groups of customers.
Generally speaking, the primary idea of the solution was to divide the customer
database into semantically interpretable clusters, and then to design a classifier which
will permit accurate definition of the customer profiles. In the final task of exploration,
the results is used to build the ontology of knowledge about the customers and its use in
marketing decision making [17].
For clustering, four algorithms were chosen: k-means, SOM neural networks,
hierarchical clustering, and supervised by the user [29, 30]. The obtained partitions by
the first three algorithms were very similar. The results showed to some degree not only
imperceptible clusters by traditional marketing methods, but also unnoticeable relationships between characteristics of customers and offered products.
Taking as a criterion for aggregation the distance between clusters and semantics of
the data, five clusters corresponding to the five potential market segments were
Fig. 1. Diagram of the data mining process
An Approach to Discovery of Customer Profiles
generated. Although it has been verified by specialists, this partition did not show
clusters of good quality in the marketing sense. Carrying out further clustering algorithms using k-means clustering and hierarchical clustering showed the groups of
customers more homogenous in terms of semantics, but still not satisfactory. Therefore,
a new method of clustering was proposed to guide the marketing specialist.
Clustering with user engagement was founded on the preliminary elimination of
irrelevant attributes related to the specificity of the clusters sought. This operation made
it possible to determine subsets with the high possibility of finding clusters semantically
interpretable. For example, one can limit the search space for the clusters associated
with “High heels” segment by eliminating irrelevant attributes such as: number of
transactions on fuel, travelling, healthcare, number of children and mobile top-ups. The
most relevant attributes were: gender (woman), age (20 to 60 years old) and monthly
income (higher than 1,500 PLN).
Within these constraints the data file containing 200 thousand instances was
reduced to approx. 60 thousand clients, women who fulfilled these restrictions.
Figure 1 shows a diagram of the process of the data mining platform using Orange1.
In order to interpret the obtained clusters, four classifiers were applied: inductive
decision trees, Multi-Layer Perceptron, Naive Bayes and CN2 [20, 32].
Comparing with the previous publication [22], in this paper the research was
extended to all classes of customers.
During the analysis of the clusters, it turned out that women to whom “High Heels”
can be addressed were dispersed into five clusters C1, C2, C3, C4 and C6. Of these five
clusters, the marketing analysts were particularly interested in the clusters with the
highest number of instances, C6 and C1, whose initial definitions were extracted from
the decision tree (Fig. 2), namely:
C1 [ C6: ((Gender true) and (Amount_Entertainment_Transactions <= 25,951)
and ((CreditCard true) and (Village True)) or ((CreditCard false) and (Village false)))
In the process of rule validation, the analysts have rejected the condition concerning
Amount_Entertainment_Transactions and turned it into the condition (EntertainmentTransactions true), indicating women who have done at least one entertainment
transaction, regardless of the amount.
Legend: The values refer to the nodes of the cluster ID, the degree of homogeneity
of the cluster (in %), and the number of instances. The names of the attributes in the
nodes have been shortened for visual reasons.
Finally, after all transformations the rules describing the considered classes-were
the following:
• Class “High heels” if ((Gender women) and (EntertainmentTransactions true))
Note that the new class definition is different from the one specified in the suggested
restrictions referred to age and income. The final definition of potential customers
for a “High Heels” card was given in Fig. 4.
Orange is an open source data visualization and data analysis package for data mining applications,
developed by Bioinformatics Labs at University of Ljubljana, Slovenia (http://orange.biolab.si) [31].
I. Pawełoszek and J. Korczak
Fig. 2. Fragment of a decision tree for clusters C1 and C6
• Class “Travellers” if ((AmountRailwayTrans >= 1,000) or (AmountPetrolTrans
>= 1,000))
The class describes the people who travel a lot, thats why both amounts of railway
and petrol transactions are high. The customers use both train and car transport. The
new sample of data marke das “Travellers” consists of 1,175 persons.
• Class “Business Card“if (Age <=60 and MonthlyIncome >= 7,000)
• To choose the right potential customers for Business Cards we were searching for
women and men with higher than average level of income –as per our assumption it
is more than 7,000. Usually they are middle age and older persons but the
assumption was made that maximum age is 60 years old. New sample of data is
marked as “Business Cards” class and it consists of 500 customers. This class is the
smallest one, average income is 10,209;
• Class “Eternal students”- if (Age <=40 and AmountOfTransEntertainment
>= 1,000)
• People who could be interested in the offer “Eternal Students” are young people
who spent more than 1,000 PLN on entertainment transactions. The number of
items in this class is 3,762.
• Class “Still young” – if (Age >= 55 and AmountOfTransHealth >= 1,000)
The class describes older people who spend significant amounts on health transactions. The number of customers who could be interested in this kind of payment
card is 1,071.
After the completion of the data exploration, the construction of the customer
ontology started by using the Protégé platform2. The schema of part of the ontology
Protégé is a ontology development platform supported by grant GM10331601 from the National
Institute of General Medical Sciences of the United States National Institutes of Health. (http://wwwmed.stanford.edu).
An Approach to Discovery of Customer Profiles
is shown in Fig. 3. Classes were a priori posed as the groups of customers interested
in specific payment cards, indicated in the previous section. Rules and discriminant
attributes of the decision tree designated the characteristics of the concepts in the
ontology, through the so-called Data Properties, and made it possible to define the
customer groups (Fig. 4).
Fig. 3. Diagram of the ontology of customer market segmentation
The ontology so created containing knowledge of the marketing profiles of customers was used not only for information retrieval from the database, but also in the
classification of new customers of the bank in connection with the specificity of the
products offered.
5 Using the Semantic Model of Customer
The example shows the utility of ontologies in making marketing decisions related to
the preparation of the offer for the bank’s customers with the explanation of why a
particular card should be offered. It was assumed that the manager does not have
complete knowledge of business informatics, and, therefore, that she or he does not
know SQL programming nor database schema, nor complex functionalities of the data
mining platforms. Protégé has been used to define the semantic model in our case
I. Pawełoszek and J. Korczak
study. The financial data was provided in a format of the Microsoft Excel spreadsheet.
To enable ontology content collection the Cellfie plugin3 was used. The Cellfie desktop
editor was used to map and to annotate the data in the spreadsheet (columns and rows)
to attributes within the financial ontology. The discovered by the data mining algorithms class descriptions were transformed into the set of design patterns. Using
knowledge about design patterns improves the database consistency and reduces the
amount of time required to validate the data.
The analyst uses only the graphical interface of Protégé, which allows him to
consult the ontology of knowledge about customers of the bank (partially shown in
Fig. 3), and facilitates access to the marketing database. Figure 4 shows the example of
a class description of potential customers for the “High heels” card. This diagram
illustrates on the left the classes of customer cards, and on the right the definition in the
form of logical expression describing this class, and at the bottom a list of customers
which meet these requirements. Based on this definition, the manager can not only
search the database of all customers who meet the class constraints, but also strengthen
or weaken the conditions of the class description. It should be noted that these operations do not require the manager’s expertise on database structure and language to
formulate queries.
Generally, the data mining methods make it possible to retrieve different clusters or
groups of customers and their marketing interpretation. The interpretation, written in
Fig. 4. Definition and instances of the “High heels” class
Cellfie is a Protégé Desktop plugin for mapping spreadsheets to OWL ontologies available at https://
An Approach to Discovery of Customer Profiles
the form of formal expressions in Protégé, enables the manager to access detailed
information about the customers, about groups of customers, dependencies, all axioms
of marketing knowledge. Concepts, as well as the relationships between them can be
easily changed and updated as a result of acquiring new knowledge, new data or new
manager experiences.
6 Summary and Directions for Further Research
The paper has presented an analysis of the marketing database containing personal data
as well as transactional and financial information. It was shown how one can construct
customer profiles through the use of data mining methods. For this purpose, a number
of algorithms for clustering and classification were applied. Note that the database
schemas are usually designed for efficient data storage, but do not provide a semantic
description to facilitate understanding of the data, interpretation, and reasoning.
Therefore, the results of data mining were used to construct ontology which describes
customer profiles.
The use of ontology with an easy to learn visual interface of the Protégé platform
allows managers to learn more about the information contained in the databases; it
provides clear definitions related to the attribute names in the database. In addition, the
proposed ontology contains the pre-defined classes that automatically make it possible
to extract the customers’ data which conform to particular customer profiles. For large
data sets (in this case 200,000 instances), data mining methods supported by the
ontology significantly simplified data analysis process. The developed solution can be
easily customized to suit the analyst’s needs - for example, when one needs to offer a
new product, or to define a new class of customer profiles. To extract customer data for
a specific marketing campaign, it is enough to enter the profile name (e.g., High heels)
to get a list of persons who meet certain criteria.
An important direction for further research is therefore to improve the application
interface by way of facilitating and monitoring the data mining process with special
emphasis on the needs and skills of business analysts.
Acknowledgments. The authors would like to thank the staff of the VSoft company, Krakow,
for providing the Pathfinder package, data, and documentations. The research was carried out as
part of project No. POIG.01.04.00-12-106/12 - “Developing an innovative integrated platform
for the financial area”, referred to as the Project, co-financed by the European Regional
Development Fund and the Innovative Economy Operational Programme 2007–2013. Special
thanks to Ghislain Atemezing of Mondeca, Paris, for his helpful suggestions concerning the
usage of Cellfie plugin and comments on draft of this paper.
1. Wyse, S.E.: Advantages and disadvantages of face-to-face data collection. Snap Surveys, 15
October 2014. http://www.snapsurveys.com/blog/advantages-disadvantages-facetoface-datacollection/. Accessed 25 July 2016
I. Pawełoszek and J. Korczak
2. Pruitt, J., Adlin, T.: The Persona Lifecycle: Keeping People in Mind Throughout Product
Design. Elsevier, Boston (2006)
3. Montgomery, D.B.: Marketing Information Systems: An Emerging View. Forgotten Books,
London (2012)
4. Stair, R.M., Reynolds, G.: Principles of Information Systems. Cengage Learning, Boston
5. Nogueira, B.M., Santos, T.R.A., Zarate, L.E.: Comparison of classifiers efficiency on
missing values recovering: application in a marketing database with massive missing data.
In: Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data
Mining, CIDM, pp. 66–72 (2007)
6. Cao, L.B., Zhang, C.Q., Liu, J.: Ontology-based integration of business intelligence, web
intelligence and agent systems. Int. J. 4(3), 313–325 (2006). IOS Press
7. Matsatsinis, N.F., Siskos, Y.: Intelligent Support Systems for Marketing Decisions. Springer
Science & Business Media, New York (2003)
8. Grassl, W.: The reality of brands: towards an ontology of marketing. Am. J. Econ. Sociol. 58
(2), 313–319 (1999)
9. Pinto, F., Alzira, M., Santos, M.F.: Ontology-supported database marketing. J. Database
Market. Cust. Strategy Manage. 16, 76–91 (2009)
10. Pinto, F., Alzira, M., Santos, M.F.: Ontology based data mining – a contribution to business
intelligence. In: 10th WSEAS International Conference on Mathematics and Computers in
Business and Economics (MCBE 2009), Czech Republic, 23–25 March (2009)
11. Zhou, X., Geller, J., Perl, Y., Halper, M.: An application intersection marketing ontology. In:
Goldreich, O., Rosenberg, Arnold, L., Selman, Alan, L. (eds.). LNCS, vol. 3895, pp. 143–
163Springer, Heidelberg (2006). doi:10.1007/11685654_6
12. Barbu, E.: An ontology-based system for the marketing information management (2006).
13. Saggion, H., Funk, A., Maynard, D., Bontcheva, K.: Ontology-based information extraction
for business intelligence. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I.,
Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G.,
Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 843–856. Springer,
Heidelberg (2007). doi:10.1007/978-3-540-76298-0_61
14. Bouquet, P., Dona A., Serafini L., Zanobini S.: ConTeXtualized local ontology specification
via CTXML. In: Bouquet, P. Harmelen, F., Giunchiglia, F., McGuinness, D., Warglien, M.
(eds.) MeaN-02 AAAI Workshop on Meaning Negotiation, Edmonton, Alberta, Canada
15. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R.V.: Semtag and seeker: boot-strapping the
semantic web via automated semantic annotation. In: Proceedings of the Twelfth
International WWW Conference (2003)
16. Domingue, J., Dzbor, M., Motta, E.: Magpie: supporting browsing and navigation on the
semantic web. In: Nunes, N., Rich, C. (eds.) Proceedings of ACM Conference on Intelligent
User Interfaces (IUI), pp. 191–197 (2004)
17. Linoff, G.S., Berry, M.J.A.: Data Mining Techniques: for Marketing, Sales, and Customer
Relationship Management. Wiley, New York (2011)
18. Ohsawa, Y., Yada, K.: Data Mining for Design and Marketing. Chapman and Hall/CRC,
Boca Raton (2009)
19. Poh, H.L., Yao, J., Jasic, T.: Neural networks for the analysis and forecasting of advertising
and promotion impact. Int. Syst. Account. Financ. Manag. 7(4), 253–268 (1998). doi:10.
20. Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques. Data
Management Systems. Morgan Kaufmann, San Fransico (2011)
An Approach to Discovery of Customer Profiles
21. Tsiptsis, K., Chorianopoulos, A.: Data Mining Techniques in CRM: Inside Customer
Segmentation. Wiley, New York (2009)
22. Pawełoszek, I., Korczak J.: From Data Exploration to Semantic Model of Customer,
submitted to publication in ACM Transactions of Knowledge Discovery from Data (2016)
23. Vyncke, P.: Lifestyle segmentation: From attitudes, interests and opinions, to values,
aesthetic styles, life visions and media preferences. Eur. J. Commun. 17, 445–463 (2002)
24. Kahle, L.R., Chiagouris, L. (eds.): Values, Lifestyles, and Psychographics. Psychology
Press, New York, London (2014)
25. Armstrong, J.S., Brodie, R.J.: Forecasting for Marketing. In: Hooley, G.J., Hussey, M.K.
(eds.) Quantitative Methods in Marketing, pp. 92–119. International Thompson Business
Press, London (1999). http://forecastingprinciples.com/files/pdf/Forecasting%20for%
20Marketing.pdf. Accessed 17 Dec 2015
26. Chattopadhyay, M., Dan, P.K., Majumdar, S., Chakraborty, P.S.: Application of artificial
neural network in market segmentation: a review on recent trends. Manag. Sci. Lett. 2, 425–
438 (2012). http://arxiv.org/ftp/arxiv/papers/1202/1202.2445.pdf. Accessed 17 Dec 2015
27. Yao, J., Teng, N., Poh, H.L.: Forecasting and analysis of marketing data using neural
network. J. Inf. Sci. Eng. 14(4), 523–545 (1998). http://www2.cs.uregina.ca/*jtyao/Papers/
marketing_jisi.pdf. Accessed 17 Dec 2015
28. Prymon, M.: Marketingowe strategie wartości na rynkach globalnych. Wydawnictwo UE we
Wrocławiu, Wrocław (2010)
29. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Data Management
Systems, 3rd edn. The Morgan Kaufmann, San Fransico (2012)
30. Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications. Data Mining
and Knowledge Discovery Series. Chapman & Hall/CRC, Boca Raton (2013)
31. Demsar, J., et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14(Aug),
2349–2353 (2013)
32. Larose, D.T., Larose, C.D.: Data Mining and Predictive Analytics. Methods and
Applications in Data Mining, 2nd edn. Wiley, New York (2015)
Security and Privacy Issues
Cyber Security Awareness and Its Impact
on Employee’s Behavior
Ling Li(&), Li Xu, Wu He, Yong Chen, and Hong Chen
Old Dominion University, 5115 Hampton Blvd, Norfolk, USA
Abstract. This paper proposes a model that extends the Protection Motivation
Theory to validate the relationships among peer behavior, cue to action, and
employees’ action experience of cyber security, threat perception, response
perception, and employee’s cyber security behavior. The findings of the study
suggest that the influence from peer behavior and employees action experience
of cyber security is an important factor for improving cyber security behavior in
organizations. Peer behavior positively affects cue to action, which positively
impacts employees’ action experience. Employees’ action experience then
would have positive impacts on their threat perception and response perception.
As a result, employees’ threat perception and response perception are positively
related to their cyber security behavior. This process is a chain reaction.
Keywords: Cyber security awareness
Employee cyber security behavior
1 Introduction
Recent cyber security breaches have caught attention of many organizations to take
appropriate measures to security their database and business, and to develop effective
cyber security policies. The top 5 cyber security threats identified by a Sungard
Availability Services survey [1] in 2014 are vulnerable web applications, being overall
security aware, out-of-date security patches, failure to encrypt PCs and sensitive data,
and obvious or missing passwords. Among these threats, security awareness was
ranked the second as the most important cyber security issue and was noted by 51% of
respondents. Therefore, designing and implementing security awareness programs,
such as cyber security policy enforcement [2–4] and mandated trainings [3, 5, 6],
security communication and computer monitoring [6], and top management commitment [6], are essential to improve cyber security.
2 Background and Hypotheses
This paper proposes a model by integrating the protection motivation theory
(PMT) and the Health Belief Model (HBM) to test the cyber security awareness and its
impact on employee’s behavior. Figure 1 shows the relationships among peer behavior,
cue to action, employees’ action experience of cyber security, threat perception (perceived severity, perceived vulnerability and perceived barriers), response perception
(response efficacy and self-efficacy), and cyber security behavior.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 103–111, 2016.
DOI: 10.1007/978-3-319-49944-4_8
L. Li et al.
Prior research has explored the reasons why security awareness programs are not
effective. Specifically, Herath and Rao [7] developed and tested a theoretical model of
the incentive effects of penalties, pressures and perceived effectiveness of employee
actions. They found that employees’ cyber security behaviors were influenced by
intrinsic and extrinsic motivators. Ng and Xu [8] adopted the Health Belief Model
(HBM) in user security study and found that users’ perceived susceptibility, perceived
benefits, and self-efficacy would determine their security behavior. A number of
published studies adopt the protection motivation theory (PMT) to investigate how
employees’ threat perception and response perception regarding cyber security impact
their compliance behaviors (e.g. [9–13]).
However, findings reported by these studies are inconsistent. For example, Ng and
Xu [8] find that individuals exposed to higher levels of cue to action do not have a
higher level cyber security behavior than others; whereas Johnston and Warkentin [10]
find that social influence have a positive effect on individuals’ intention to adopt cyber
security actions. Individuals’ perceived severity of cyber-attacks have been found have
both positive impacts [9, 13, 14], or negative impacts [8], or even no impact [15] on
their intention to comply with cyber security policies. Similarly, individuals’ perceived
vulnerability of cyber-attacks has been found to be both positively [14] or negatively
[13] influence their intention to comply with cyber security policies. Furthermore,
individuals’ response efficacy of cyber-attacks is found to be both positively [9, 10] or
negatively [13, 14] affect their intention to comply with cyber security policies as well.
We intend to provide a clearer picture on employee cyber security behavior by
proposing a model (Fig. 1) that integrates the protection motivation theory (PMT) and
the Health Belief Model (HBM) to validate the relationships among peer behavior, cue
to action, employees’ action experience of cyber security, their threat perception
(perceived severity, perceived vulnerability, and perceived barriers) and response
perception (response efficacy and self-efficacy), and their cyber security behavior.
A number of hypotheses based on Fig. 1 have been developed.
Fig. 1. Conceptual Model
Cyber Security Awareness and Its Impact on Employee’s Behavior
Hypothesis 1. Peer behavior is positively associated with cues to action for
employees’ cyber security behaviors.
Hypothesis 2. Cues to action positively affect employees’ action experience of
cyber security.
Hypothesis 3a. Employees’ action experience positively affects their perceived
severity of cyber security incidents.
Hypothesis 3b. Employees’ action experience positively affects their perceived
vulnerability caused by cyber security incidents.
Hypothesis 3c. Employees’ action experience negatively affects their perceived
barriers about cyber security incidents.
Hypothesis 3d. Employees’ action experience positively affects their responseefficacy about cyber security incidents.
Hypothesis 3e. Employees’ action experience positively affects their self-efficacy
about cyber security incidents.
Hypothesis 4a. Employees’ perceived severity positively affects their self-reported
cyber security behavior.
Hypothesis 4b. Employees’ perceived vulnerability positively affects their
self-reported cyber security behavior.
Hypothesis 4c. Employees’ perceived barriers negatively affect their self-reported
cyber security behavior.
Hypothesis 4d. Employees’ response efficacy positively affects their self-reported
cyber security behavior.
Hypothesis 4e. Employees’ self-efficacy positively affects their self-reported cyber
security behavior.
3 Research Method
The empirical data was collected using a survey questionnaire in the US in 2015. Sample
size in this study is 579. The socio-demographic characteristics data are reported in
Table 1. About 35% of the respondents are male and 65% are female. Among the
participants, 68.58% are under 30 years old. Respondents are from diverse industries.
When they were asked whether their company had an explicit cyber security policy,
about 46% of the participants answered “yes”, 14.68% answered “no”, and a little over a
third of the participants (39.21%) said that they knew nothing about their company’s
information security policy. Variables about behavior and belief are assessed via a
seven-point Likert scale, ranging from strongly disagree (1) to strongly agree (7).
Structural equation modeling (SEM) method was applied to explore the relationships among the constructs in the conceptual model. SEM follows a two-step approach
that includes constructing the measurement model and testing the structural model.
Specifically, we test the proposed model and assess the overall fit using the maximum
likelihood method in Amos.
Nine latent constructs and their observed variables are measured in the proposed
model. Most of measurements in this study were tested in previous studies. To assess
the reflective constructs in our measurement model, we examined construct reliability
and validity, convergent validity, and discriminant validity. First, we conducted
L. Li et al.
principal component analysis to identify and to confirm the different factors under each
construct in our model. Specifically, we ran exploratory factor analysis (EFA) and
confirmatory factor analysis (CFA) in SPSS. EFA using principal-component factor
analysis with Varimax rotation was performed to examine the factor solution among
the nine factors in the study. The results reveal that the nine factors have eigenvalues
greater than 1. Next, CFA is conducted to confirm the factors under each latent variable. The results of CFA are shown in Table 2.
The results of CFA confirm the significance of all paths between observed variables
and the first order latent variables at the significant level p < 0.001. The construct
validity of our model is explained through the percentage of variance extracted [16].
The total variance explained by each construct is in the range of 53–73% (see Table 2).
Reliability for the constructs is assessed via Cronbach’s alpha. The reliability for all
Table 1. Socio-demographic characteristics
Frequency Percent (%)
Younger than 18
51 and above
Finance/Banking/Insurance 18
Information technology
Real estate
Security Policy Awareness
Don’t know
Cyber Security Awareness and Its Impact on Employee’s Behavior
Table 2. Results of factor analysis
Action experience(AE)
Perceived vulnerability(PV)
Perceived severity(PS)
Perceived barriers(PBA)
Response efficacy(RE)
Cues to action(CA)
Security self-efficacy(SE)
Peer behavior(PBE)
Loading S.E. R2
Total Cronbac AVE
62.45 0.80
68.72 0.85
73.41 0.82
54.61 0.72
64.01 0.81
62.25 0.88
67.50 0.76
0.72*** 0.12 0.52
0.77*** 0.10 0.59
0.97*** 0.10 0.94
0.78*** 0.08 0.61
0.81*** 0.09 0.66
0.67*** 0.08 0.45
L. Li et al.
Table 2. (continued)
Loading S.E. R2 Total Cronbac AVE
Self-reported security behavior(SCB)
53.46 0.71
0.63*** 0.08 0.40
0.64*** 0.06 0.42
0.46*** 0.08 0.21
0.71*** 0.08 0.51
constructs is considered acceptable [17], because all the values are bigger than the
threshold 0.70 (Table 2). Hence, we claim that both the construct validity and the
construct reliability of our model are satisfactory.
Convergent validity assesses consistency across multiple items. It is shown when
the indicators load much higher on their hypothesized factor than on other factors (i.e.,
own loadings are higher than cross loadings). Items that do not exceed the threshold
will be dropped from the construct list. For our model, all estimated standard loadings
are significant at the significant level of p < 0.001 [18] with acceptable magnitude
(>0.50, ideal level is >0.70) [19] except SCB3. The results indicate that the measurements in our model have good convergent validity.
The fit statistics of the structural model is reported in Table 3. The fit indices
chosen for our model represent two characteristics: the global fit measures and comparative fit measures. The chi-square test (v2) with degrees of freedom is commonly
used as the global model fit criteria. The chi-square statistic must, however, be interpreted with caution especially for a large sample size because the hypothesized model
may be rejected if the discrepancy is not statistically equal to zero. We choose comparative fit index (CFI), goodness of fit index (GFI), incremental fit index (IFI), and
root mean square error of approximation (RMSEA) to assess the congruence between
the hypothesized model and the data.
The goodness of fit indices for the specified model are displayed in Table 3. The v2
value for the structural equation model is 1882 (DF = 582). The ratio of v2 and the
degrees of freedom (DF) is 3.23. The comparative fit index (CFI) is 0.87, the
goodness-of-fit index (GFI) is 0.84 and the incremental fit index (IFI) is 0.87. All the
values are closed to the generally accepted minimum norms for satisfactory fit of 0.90.
The test of the structural model includes estimating the path coefficients, which
indicate the strength of the relationships between the independent and dependent variables, and the R2 values, which are the amount of variance explained by the independent
variables. The full set of relationship for the structural model is provided in Table 4.
The hypotheses in our structural model test the relationships among peer behavior,
cue to action, employees’ action experience of cyber security, threat perception (perceived severity, perceived vulnerability, and perceived barriers), response perception
(response efficacy and self-efficacy), and their cyber security behavior. The results of
our study support 11 out of 12 hypotheses that have been developed based on the
conceptual model in Fig. 1. Hypothesis 4a (Employees’ perceived severity positively
affects their self-reported cyber security behavior) is the only one that is not supported.
Table 4 shows the summary of hypotheses test result for the structural model.
Cyber Security Awareness and Its Impact on Employee’s Behavior
Table 3. Fit statistics for structural model
Model goodness of fit statistics
Model value
Root mean square error of approximation (RMSEA) 0.062
Comparative fit index (CFI)
Goodness-of-fit index (GFI)
Incremental fit index (IFI)
4 Discussions
This paper proposes a model that integrates the protection motivation theory and the
Health Belief Model to validate the relationships among peer behavior, cue to action,
employees’ action experience of cyber security, threat perception (perceived severity,
perceived vulnerability, and perceived barriers), response perception (response efficacy
and self-efficacy), and their self-reported cyber security behavior. The results confirm
that (a) peer behavior is a significant factor in enhancing the cue to action for
employee’s behavior towards cyber security; (b) cue to action significantly influences
employees’ action experience related to cyber security; (c) employees’ action experience of cyber security positively affects their perceived severity, perceived vulnerability, response efficacy, and security self-efficacy but negatively affects their perceived
barriers; (d) employees’ perceived severity, perceived vulnerability, response efficacy,
Table 4. Summary of hypotheses test result for the structural model
Peer behavior ! Cue to action
Cue to action ! Action experience
Action experience ! Perceived severity
Action experience ! Perceived vulnerability
Action experience ! Perceived barriers
Action experience ! Response efficacy
Action experience ! Security self-efficacy
Perceived severity ! Self-reported security
Perceived vulnerability ! Self-reported security
Perceived barriers ! Self-reported security
Response efficacy ! Self-reported security
Security self-efficacy ! Self-reported security
Standard path
L. Li et al.
and self-efficacy positively impact their self-reported security behavior and employees’
perceived barriers negatively impacts their self-reported security behavior. These
findings concur with the results in previous research regarding the factors that
regarding employees’ cyber security behavior in workplace [8–10, 12, 14].
This study explores self-reported cyber security behavior to measure employees’
cyber security activities; this approach is different from prior cyber security studies that
used behavioral intention or likelihood of behavior as their dependent variables. Our
measurement reflects employees’ actual behavior, not their intentions. Therefore, the
results achieved in this study are more convincing.
The results of this study reveal that the influence from peer behavior and employees
own action experience of cyber security is an important factor for improving cyber
security in organizations. Peer behavior positively affects cue to action, which positively impacts employees’ action experience (H1 and H2). Employees’ action experience then would have positive impacts on their threat perception and response
perception (H3a, H3b, H3d, and H3e). As a result, employees’ threat perception and
response perception positively affect their cyber security behavior (H4a, H4b, H4d, and
H4e). This process is a chain reaction.
5 Conclusions
From the findings of the study, we may suggest that organizations may consider
developing a system of rewards to create a pro-security internal atmosphere. Particularly, those employees who follow cyber security regulations and rules should be
encouraged. In this way, employees can get clear cues from their peers in terms of
taking cyber security action. Meanwhile, organizations should promote experience
sharing regarding mitigating cyber security risks and reducing cyber security threat.
This could be realized through effective training programs.
This study has limitations that should be taken into account. Future research need to
compare the results of self-reported behavior and behavioral intention/likelihood of
behavior. Future research may also analyze the moderating effect of cyber security
policy awareness level, industry, employee age, and other factors with other statistical
tools. Moreover, future research should explore the underlying causes of the moderating effect of gender and examine the effect using empirical tests.
Acknowledgements. This work was supported by the National Science Foundation of the U.S.
under [Grant Number 1318470].
1. DeMetz, A.: The #1 cyber security threat to information systems today (2015). http://www.
2. Chen, Y., He, W.: Security risks and protection in online learning: a survey. Int. Rev. Res.
Open Distrib. Learn. 14(5), 1–20 (2013)
Cyber Security Awareness and Its Impact on Employee’s Behavior
3. D’Arcy, J., Hovav, A., Galletta, D.: User awareness of security countermeasures and its
impact on information systems misuse: a deterrence approach. Inf. Syst. Res. 20(1), 79–98
4. Yayla, A.: Enforcing information security policies through cultural boundaries: a multinational company approach. In: Proceedings of 2011 ECIS, Paper 243, pp. 1–11 (2011)
5. Stoneburner, G., Goguen, A.Y., Feringa, A.: SP 800-30. Risk management guide for
information technology systems (2002)
6. D’Arcy, J., Greene, G.: Security culture and the employment relationship as drivers of
employees’ security compliance. Inf. Manag. Comput. Secur. 22(5), 474–489 (2014)
7. Herath, T., Rao, H.R.: Encouraging information security behaviors in organizations: role of
penalties, pressures and perceived effectiveness. Decis. Support Syst. 47(2), 154–165 (2009)
8. Ng, B.Y., Xu, Y.: Studying users’ computer security behavior using the health belief model.
In: Proceedings of PACIS 2007, vol. 45, pp. 423–437 (2007)
9. Herath, T., Rao, H.R.: Protection motivation and deterrence: a framework for security policy
compliance in organisations. Eur. J. Inf. Syst. 18(2), 106–125 (2009)
10. Johnston, A.C., Warkentin, M.: Fear appeals and information security behaviors: an
empirical study. MIS Q. 34, 549–566 (2010)
11. Siponen, M., Mahmood, M.A., Pahnila, S.: Technical opinion are employees putting your
company at risk by not following information security policies? Commun. ACM 52(12),
145–147 (2009)
12. Steinbart, P.J., Keith, M.J., Babb, J.: Examining the continuance of secure behavior: a
longitudinal field study of mobile device authentication. Inf. Syst. Res. 27, 219–239 (2016)
13. Vance, A., Siponen, M., Pahnila, S.: Motivating IS security compliance: insights from habit
and protection motivation theory. Inf. Manag. 49(3), 190–198 (2012)
14. Siponen, M., Mahmood, M.A., Pahnila, S.: Employees’ adherence to information security
policies: an exploratory field study. Inf. Manag. 51(2), 217–224 (2014)
15. Ng, B.Y., Kankanhalli, A., Xu, Y.C.: Studying users’ computer security behavior: a health
belief perspective. Decis. Support Syst. 46(4), 815–825 (2009)
16. Fornell, C., Larcker, D.F.: Structural equation models with unobservable variables and
measurement error: algebra and statistics. J. Market. Res. 18, 382–388 (1981)
17. Gefen, D., Straub, D., Boudreau, M.C.: Structural equation modeling and regression:
guidelines for research practice. Commun. Assoc. Inf. Syst. 4(1), 7 (2000)
18. Gefen, D., Straub, D.: A practical guide to factorial validity using PLS-graph: tutorial and
annotated example. Commun. Assoc. Inf. Syst. 16(1), 5 (2005)
19. Chin, W., Marcolin, B.: The holistic approach to construct validation in IS research:
examples of the interplay between theory and measurement. In: Administrative Sciences
Association of Canada Annual Conference, vol. 16, pp. 34–43. Administrative Sciences
Association of Canada (1995)
Lessons Learned from Honeypots - Statistical
Analysis of Logins and Passwords
Pavol Sokol1(B) and Veronika Kopčová2
Faculty of Science, Institute of Computer Science,
Pavol Jozef Safarik University in Kosice, Jesenna 5, 040 01 Kosice, Slovakia
[email protected]
Faculty of Science, Institute of Mathematics,
Pavol Jozef Safarik University in Kosice, Jesenna 5, 040 01 Kosice, Slovakia
[email protected]
Abstract. Honeypots are unconventional tools to study methods, tools
and goals of attackers. In addition to IP addresses, timestamps and count
of attacks, these tools collect combinations of login and password. Therefore, analysis of data collected by honeypots can bring different view of
logins and passwords. In paper, advanced statistical methods and correlations with spatial-oriented data were applied to find out more detailed
information about the logins and passwords. Also we used the Chi-square
test of independence to study difference between login and password. In
addition, we study agreement of structure of password and login using
kappa statistics.
Keywords: Honeypot
test · Kappa statistic
Spatial data
In current information society we deal with an increasing security threat. Therefore, an important part of information security is protection of information. Common security tools, methods and techniques used before are ineffective against
new security threats. Therefore, it is necessary to choose other tools and techniques. It seems that the network forensics, especially honeypots and honeynets,
are very useful tools. The use of the word “honeypot” is quite recent [1], however
honeypots have been used for more than twenty years in computer systems. It
can be defined as a computing resource, whose value is in being attacked [2].
Lance Spitzner defines honeypot as an information system resource whose value
lies in unauthorized or illicit use of that resource [3].
The most common classification of honeypot is classification based on the
level of interaction. The definition of level of interaction is the range of possibilities the attacker is given after attacking the system. Honeypots can be divided
into low-interaction and high-interaction. Example of this type of honeypots is
c IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 112–126, 2016.
DOI: 10.1007/978-3-319-49944-4 9
Lessons Learned from Honeypots - Statistical Analysis
Dionaea [4]. On one hand, low-interaction honeypots emulate the characteristics
of network services or a particular operating system. On the other hand, a complete operating system with all services is used to get more accurate information
about attacks and attackers [5]. This type of honeypot is called high-interaction
honeypot. Example of this type of honeypots is HonSSH [6].
Concept of honeypot is extended by honeynet - a special kind of highlevel interaction honeypot. The honeynet can be also referred to as “a virtual
environment, consisting of multiple honeypots, designed to deceive an intruder
into thinking that he or she has located a network of computing devices of
targeting value” [7]. Four main parts of the honeynet architecture are known,
namely data control, data capture, data collection and data analysis [2,7].
The main reason to use these tools is collection and analysis of data captured
using honeypots and honeynets. Learning new unconventional information about
the attacks, attackers and tools is involved in the protection of the network
services and computer networks of organizations. Each honeypot collects the IP
addresses of attackers and special data according to type of honeypot. In paper
we use the low-interaction honeypots Kippo [8], which collect timestamps, IP
address of attacker, type of SSH clients and combination of logins and passwords.
For purpose of this paper we focus on logins, passwords and their combinations.
This paper is a sequel to the analysis of data collected from honeypots and
honeynets. In paper [9] authors focus on automated secure shell (SSH)
bruteforce attacks and discuss the length of passwords, password composition
compared to known dictionaries, dictionary sharing, username-password combination, username analysis and timing analysis. On the other hand, the main
aim of this paper is to provide light on attackers’ behaviour, and provide recommendations for SSH users and administrators. In this paper we focus on two
main statistical analyses. Firstly, chi-square test of independence that analyzes
group of differences. Secondly, Kappa statistics that measures agreement between
To formalize the scope of our work, authors state two research questions:
– What attribution of logins, passwords and their attribution are significant for
security of systems?
– What is the relationship between the logins and passwords and origin of
This paper is organized into seven sections. Section 2 focuses on the review
of published research related to lessons learned from analysis in the honeypots
and honeynets. Section 3 outlines the dataset and methods used for experiment.
Sections 4, 5 and 6 focus on statistical and spatial analysis of logins, passwords
and combination of them. The last section contains conclusions, discussion and
our suggestions for the future research.
Related Works
As it was mentioned before, the main task of honeypots and honeynet is in
analysing the captured data and searching for new knowledge about the attacks
P. Sokol and V. Kopčová
and attackers. This section provides overview of papers that focus on lessons
learned from honeypots and honeynets data.
Analysis of data collected by high-interaction honeypots are discussed
in Nicomette et al. [10] and Alata et al. [11]. [10] concentrate on the attacks
executed by the SSH service and the activities executed after attackers gain
access to the honeypot. Attackers and their activities after logging in are
discussed in [11]. Authors correlated their findings with the results from distributed low-interaction honeypots.
But then, low-interaction honeypots are discussed in Sochor and Zuzcak in
papers [12,13]. In [12] data show currently spreading threats caught by honeypots.
But then, the thorough interpretation of lessons learned from using the honeypots
was outlined. Principal results are shown in [13], in addition they underline the
fact that the differentiation between honeypots according to their IP address is
quite rough (e.g. differentiation for academic and commercial network).
SGNET was used by [14] as a distributed system of honeypots. They
doubt the floatation of representative malware samples datasets. They claim that
the false negative alerts differ from what they are allowed to be. Additionally,
there is occurrence of false positive alerts on abrupt places. Clustering attack
patterns with a suitable similarity measure are discussed in [15]. The results of
this study allow identification of the activities of several worms and botnets in
the collected traffic.
Time-oriented data were of interest in [16]. Visualization of this data in
honeypots and honeynets was outlined. In addition, the authors provide results
based on heatmaps that is special visualisation. It was proved that the time is
an important aspect of attacks. Attackers are mainly active at night (according
to the honeynets time zone analysis).
Next example of using low-interaction honeypots (Dionaea) in order to studying is in [17]. It presents the results of nearly two years operation of honeypot
systems, installed on unprotected research network. The paper focuses on the
information about the life time of malware programs and the long-time malware
Data Collection and Analysis Methodology
The data were collected from the honeynet located in the campus network.
The honeynet that runs on port 22 consists of SSH honeypots Kippo [8] in
low-interaction mode. The honeypots do not allow attackers to log into shell in
this mode, they only capture data about network flows entering the honeynet.
The honeypots have collected authentication attempts from 3rd August 2014 to
24th December 2015. During this period 1 391 746 records were collected.
Each record contains username and password used in an attempt, as well as IP
address and version of client of attacker, beginning and end of sessions. Dataset
contain unique 5 488 logins, unique 205 477 passwords and unique 212
687 combinations of login and password.
Lessons Learned from Honeypots - Statistical Analysis
For spatial analysis, each record was competed with spatial data using the
IP-API.com service [18]. This service provides free use of its Geo IP API
through multiple response formats. Each record was supplemented with time
zone, country, region, city, Internet service provider (ISP), and global positioning
systems (GPS) coordinates.
Data cleaning and analysing was performed using, the HoneyLog framework [19]. This framework for analysing honeypots and honeynets data is based
on a PHP framework of FuelPHP and JavaScript libraries. It has two main
segments: a client part and a server part.
For purpose of paper, important part of dataset consists of combination of
logins and passwords. Since the logins and passwords are the qualitative data it
needed to be converted into quantitative data. For each login and password, we
assigned following attributes:
– contains only lowercases - login or password contains only lowercase characters (ASCII codes between 97 and 122);
– contains only uppercases - login or password contains only capital characters (ASCII codes between 65 and 90);
– contains only numbers - login or password contains only numbers (ASCII
codes between 65 and 90);
– contains number - login or password contains at least one number;
– contains year - login or password contains year (2014 or 2015) and
– contains special character - login or password contains at least one special
character (ASCII codes 32-47, 58-64, 91-96 and 123-127);
In paper we use two statistical methods: chi-square test of independence
and kappa statistics. The Chi-square test of independence, also known as
the Pearson Chi-square test [20], is one of the most useful tools for testing hypotheses when the variables are nominal. It is a non-parametric tool
designed to analyse group differences. Each non-parametric test has its own
specific assumptions as well. The assumptions of the Chi-square include:
The data in the cells should be frequencies, or counts of cases.
The categories of the variables are mutually exclusive.
Each subject may contribute data to one and only one cell in the Chi-square.
The study groups must be independent.
While Chi-square has no rule about limiting the number of cells (by limiting
the number of categories for each variable), a very large number of cells
(over 20) can make it difficult to meet assumption #6 below, and to interpret
the meaning of the results.
6. The value of the cell expected should be 5 or more in at least 80% of the cells,
and no cell should have an expected of less than one (3). This assumption is
most likely to be met if the sample size equals at least the number of cells
multiplied by 5.
On the other hand, Kappa [21] is intended to give the reader a quantitative measure of the magnitude of agreement between observers. Interobserver
variation can be measured in any situation in which two or more independent
observers are evaluating the same thing.
P. Sokol and V. Kopčová
The first observed aspect of analysis is login. Top 10 logins are shown in
Fig. 1(left). This diagram shows that the most tested login is root. According to other logins, attackers test default logins for different systems (admin,
user, PI, Oracle, etc.). Also attacker is often trying the same login and password combination. In this paper we focus on analysis of login with the largest
number of unique passwords. Top 10 logins with unique passwords are shown in
Fig. 1(right). From this perspective, the most tested login is root. Attacker also
tests following logins with large number of unique passwords: user, test, nagios,
Fig. 1. Top 10 logins and top 10 logins with unique passwords
Attributes of Logins
According to Linux documentation for tool useradd [22], Unix/Linux’s username
(login) equals regular expression ˆ[a-z ][a-z0-9 -]*[$]?$. This expression means
that the first character of login is lowercase and other characters are lowercases
or numbers. Also capital letters are not allowed. Moreover, logins must neither
start with a dash nor contain a colon or a whitespace, end of line and tabulation
etc. Documentation notes that using a slash may break the default algorithm
for the definition of the user’s home directory.
As we can see in Fig. 2, the largest group of logins is logins containing only
lowercases (88,47%). A slight amount of logins contains a number (7,89%)
or special character (4,46%). According to our opinion, logins, which contain
capital letters or special character are tested by special group of attackers - script
kidies or attacks were directed to other systems like UNIX/LINUX.
Another studied aspect is the length of logins (Fig. 3). According to above
mentioned Linux documentation [22], logins may only be up to 32 characters
long. The length of tested logins is in range from 1 to 50 characters. The logins
with length between 33 and 50 are a sign of incorrect use of automated programs.
Lessons Learned from Honeypots - Statistical Analysis
Fig. 2. Attributes of logins
For example root$1$a1O0GlNs$KPwONdPK6G5KqjsVNNOyb. The largest
group of logins contains six characters. The largest amount of logins has number
of characters in range from 3 to 14.
Fig. 3. Length of logins
Frequency of ASCII Characters in Logins
For purpose of the frequency of ASCII characters in logins we created frequency
table (Fig. 4). This table takes into account the frequency of at least one occurrence of a given character within a login. ASCII character with the highest
occurrence is lowercase a. Lowercase e, which is the most frequent character
in many alphabets (e.g. English, French and German alphabet), is in the 2nd
place. On the other hand, lowercase q and x have the lowest occurrence. The
most used number is 1 and 2. On the other hand, 6 and 8 are used at least. In
the most cases the login contain special character /. In contrast to this, passwords do not contain this character. According to our opinion, it is again sign
of incorrect use of automated programs.
P. Sokol and V. Kopčová
Fig. 4. Frequency table of ASCII characters in logins
Logins and Origin of Attacks
Table 1 shows top 20 countries, which are origin of attacks. For each country,
table shows the count of attacks, top login and its count and percentage and
the top three logins, which are tested by attackers from country. The login root
is the most tested login from each top 20 country. The interesting fact is
that percentage of tested login root to all tested passwords from country is
different. On one hand, there is high percentage in countries such as China,
Hong Kong, France, Hungary etc. On the other hand, there is low percentage in
countries such as Argentina or Singapore. The most tested group of logins are
root/admin/ubnt, root/admin/test and root/admin/user. Based on this
it can be concluded that groups of tested logins, considering origin of attacks,
can be interesting indicator for finding group of attackers.
The second observed aspect is password. Compared to logins the types of passwords are pronounced. The most commonly used password is admin. Top 10 the
most used passwords (123456, password, root, 1234, etc.) is shown in Fig. 5(left).
Like in login, we focus on the passwords that are used with the most unique
logins. In this regard, the most used login is password (none). Other most used
passwords with the most unique logins are shown in Fig. 5(right).
Attributes of Passwords
In this section we focus on attributes of passwords. These attributes are
shown in Fig. 6. Compared to the login, Linux documentation does not restrict
password from the perspective of characters (no security). It is due to the fact
that system stores hash of password (no clear password). According to Fig. 6 the
most frequently used passwords contain numbers (50,36%). A slightly smaller
Lessons Learned from Honeypots - Statistical Analysis
Table 1. Logins and top 20 countries
Count of attack Top login Count (percent) of The 2nd and 3rd
top login
895 945
873 321 (97,47%)
Hong Kong
219 621
219 025 (99,73%)
123 430
122 889 (99,56%)
United States
92 721
81 381 (87,77%)
6 952
6 820 (98,10%)
Rep. of Korea
5 459
4 074 (74,63%)
2 872
804 (27,99%)
2 851
1 848 (64,82%)
2 609
868 (33,27%)
2 131
87 (4,08%)
2 113
188 (8,90%)
2 021
1 095 (54,18%)
1 536
472 (30,73%)
1 358
641 (47,20%)
1 343
437 (32,54%)
1 276
597 (46,79%)
1 142
467 (40,89%)
1 127
697 (61,85%)
1 124
642 (57,12%)
712 (75,83%)
number of the passwords containing only lowercase (45,24%). In contrast, entries
containing only a number occur almost three times less often. An interesting fact
is that among the top 10 passwords were four passwords containing only numbers
(123, 1234, 12345, 123456) (9,9%) and the only one password containing only
lowercase characters (test) (0,83%).
Another attribute of password is its length. The length of the password is
in the range between 0 and 98. The most passwords contain 8 characters.
The largest number of length of passwords is in the range between 3 and 20
characters. It is worth mentioning that passwords with 32 characters are hashes
(e.g. 706e642a056c7e894ed5a01e55700004). Number of characters of passwords is
shown in Fig. 7(left). Passwords with 33 characters and more are a sign of incorrect using of tool (e.g. #files th a:hover {background:transparent; border...) or
manual attack by script-kidies (e.g. rooooooooooooooooooooooooooooooooooooooooooooooooooooooooooot)
We also focus on the largest group of passwords that contains only numbers.
In this group the largest subgroup of passwords contains 8 respectively 6 digits. Number of length of passwords, which contain only numbers, are shown in
Fig. 7(right).
P. Sokol and V. Kopčová
Fig. 5. Top 10 passwords and top 10 passwords with unique logins
Fig. 6. Attributes of passwords
Fig. 7. Length of passwords
Frequency of ASCII Characters in Passwords
Like for a login, the frequency tables of ASCII characters in passwords were created (Fig. 8). This table takes into account the frequency of at least one occurrence of a given character within a password. ASCII character with the highest
occurrence is lowercase a. Lowercase e, which is the most frequent character
in many alphabets (e.g. English, French and German alphabet), is in the 2nd
place. On the other hand, capital V and capital K have the lowest occurrence.
Lessons Learned from Honeypots - Statistical Analysis
Similar to login, the most used number is 1 and 2. On the other hand, 6 and 7
are used the least. In the most cases the passwords contain special characters
@ and !. Interesting fact is occurrence of characters Horizontal Tab (ASCII
code 9) and Device control 1-4 (ASCII codes 17-20) in passwords (e.g. %username DC1 [email protected], %username DC2 34567890-=). These codes are used for software
flow control (e.g. DC 1 for quit application). These codes are not visible in logs.
Passwords with these codes begin with special characters !, % or @ and they
are linked to login root. According to our opinion, passwords with these codes
are used in incorrect using of a tool by script-kidies.
Fig. 8. Frequency table of ASCII characters in passwords
Passwords and Origin of Attacks
Table 2 shows top 20 countries, where attacks originated. For each country, table
shows the count of attacks, the most used passwords with their count and percentage and the top three logins, which were tested by attackers from country.
In table (none) means that password without chars was inputted. The password
123456 is the most tested from 7 top countries. An interesting finding is password weubao in Hong Kong. In case of logins, there is similar the most tested
groups of logins considering the origin of attacks. In case of passwords, there are
no similar groups with top 3 passwords. Based on this it can be concluded that
there is relationship between passwords and origin of attacks.
Combination of Logins and Passwords
In previous sections we focus on logins and passwords. Since attacker test combinations of login and password, we focus on this aspect. The most tested combination of login and password, which are used by attackers, are following:
root/admin, root/root, root/Password, root/123456, root/toor, root/1234,
root/1 etc. In the following sections we focus on relationship between logins and
P. Sokol and V. Kopčová
Table 2. Passwords and top 20 countries
Count of Top password Count (percent) The 2nd and 3rd
of top password password
895 945
6 384 (0,71%)
Hong Kong
219 621
188 (0,09%)
123 430
112 (0,09%)
United States
92 721
895 (0,97%)
6 952
14 (0,20%)
Rep. of Korea
5 459
90 (1,65%)
2 872
164 (5,71%)
2 851
58 (2,03%)
2 609
48 (1,84%)
2 131
44 (2,06%)
2 113
23 (1,09%)
2 021
181 (8,96%)
1 536
114 (7,42%)
1 358
28 (2,06%)
1 343
144 (10,72%)
1 276
43 (3,37%)
[email protected]*i%n$t#o!(s
1 142
8 (0,70%)
1 127
36 (3,19%)
1 124
80 (7,12%)
23 (2,45%)
Association Between Passwords and Logins and Their
For purpose of association between passwords and logins and their attributions
the Chi-square test of independence [20] is used. In our case study, there are
two groups: passwords and logins. The independent variable is login/password
and dependent variable is its attribution: special char, only number, number,
only uppercase. Our goal is to find out, whether login and password differ. Table 3
shows our data where marginals were calculated.
The formula for calculating Chi-Square values is: χ2 = (O − E)2 /E, where
O is observed and E is expected value. Chi-Square expecteds are calculated as
follows: E = M r ∗ M c/n. Table 4 provides the results of this calculation for each
cell. Expected value (chi square value).
Now we sum cell chi square values to obtain chi square statistic for the
table. In this case it is 3571. The chi square table requires knowledge of degrees
of freedom to determine the significance level of the statistics. It holds: df =
Lessons Learned from Honeypots - Statistical Analysis
Table 3. Calculation of marginals
Special char Only number Number Only uppercase Marginals Mr
41 623
177 543
Marginals Mc 42 612
177 769
442 514 3 862
665 542
3 198
444 447 3 912
668 740
Table 4. Cell expected values and (cell Chi-square values)
Special char
Only number
Only uppercase
Password 42408,22 (14,54) 176918,89 (2,20) 442321,60 (0,08) 3893,29 (0,25)
203,78 (3025,76) 850,11 (458,20)
2125,40 (17,42)
18,71 (52,34)
(numberof rows − 1) ∗ (numberof columns − 1) = 1 ∗ 3 = 3. The critical value for
chi square distribution with df = 3 is 7,815. So our calculated value is bigger
than critical value: 3571 > 7, 815 and we can conclude that null hypothesis is
rejected, which means that there is a relationship between login and password.
However, this result does not specify what impact on this relationship. It can
be seen in Table 4. The largest values of cell chi square values can be seen in a
special char for login. It means that number of logins that contain special char
is significantly greater than expected value. On the other hand, cell chi square
values less than 1 means that number of observed cases is equal to number of
expected cases. So there is no effect on password for number and only uppercase.
Table 5. Examples of logins and passwords in Chi-square test of independence
Special char Only number Number
Password [email protected]# 30011970
root!¨?$%& 12345678
Only uppercase
Aa12345root NASA
Based on the above mentioned, it can be concluded that there is a relationship between the login and password. Especially if the password contains
a special character or number. Logins typically contain only lowercases. Therefore, if it contains special characters, numbers, at least one number or all capital
characters, there is a relationship between the login and password. In the
greatest extent it occurs in case of login with special character (e.g. password [email protected]# for login root). Another example is the login root!”?$%&with
password (none) (another types in Table 5). In these cases, it can be concluded
that it is not a dictionary attack, respectively brute force attack, but a manual
attack or automated attack by script-kidies.
P. Sokol and V. Kopčová
Agreement of Structure of Password and Login
For study agreement of structure of password and login, we use kappa statistics.
The data were collected in Table 6.
Table 6. Kappa statistics
Login/Password Special char Only number Number Only uppercase Total
special char
only number
only uppercase
We can simply calculate the percentage of agreement as a sum of diagonals divided by number of observations, we have 90,3% agreement. But
that measure does not take into account the random chance of agreement.
We calculate expected agreement that is P e = 0, 416. Formula for kappa:
K = (P o − P e)/(1 − P e) = 0, 834. Using table in [21] we can conclude that
agreement of login and password is substantial (Table 7).
Table 7. Examples of logins and passwords in Kappa statistics
Login/Password Special char
Only number
Only uppercase
special char
root/, . kl;iop890 root/ − ∗ 123456 root-*123456
only number
123456 123456
rootzo9 ∗?qp
tom6bj 278497
r00t loler11q
only uppercase
SZIM 888888
Conclusions, Recommendations and Future Works
Attacks collected by honeypots are interesting source for further analysis. In
paper we focus on logins, passwords and their combination. We outline statistical analysis of collected data. General rules for passwords creating state that
password should contain lowercase, capital letter, number and special character.
Length of password should be 8 or more. According to above mentioned, we
propose to use capital V, capital K and number 6 and 7 in passwords. We
recommend avoiding the following lowercases: a, e, i, n, r, o, s and following
numbers: 1, 2, 3 and 9. To strengthen password it is recommended to use
password with length 10 or more and special characters: [,],{and}.
Since the combination of login and password is used in attack, it is needed to
deal with the strength of login. General safety rules state that default passwords
Lessons Learned from Honeypots - Statistical Analysis
and root should not be used. We agree with these rules, but above mentioned
we propose the following rules for login creating. The first character of password
must be lowercase. Lowercase q or x look like the best choice. The login must
have length between 1 and 32 characters. We recommend use the login with
length between 12 and 32 characters. We recommend avoiding the following
lowercases: a, e, i, r, n, o, s, t, l, c and following numbers: 1, 2, 3 and 0.
In general, using the numbers increase the security of the password, especially
numbers: 6, 7 and 8.
As we showed before, Chi-square test of independence and Kappa statistics
show that there is relationship between logins and passwords. On the basis of
these tests, attacks can be divided into manual attacks and automated attacks.
In the future, the research in field of analysis of collected data will continue.
We will primarily focus on types of clients and time-oriented analysis from the
perspective of logins and passwords.
Acknowledgments. We would like to thank colleagues from the Czech chapter of
The Honeynet Project for their comments and valuable input. This paper is funded
by the Slovak Grant Agency for Science (VEGA) grants under contract No. 1/0142/15
and No. 1/0344/14, VVGS projects under contract No. VVGS-PF-2016-72610 and
No. VVGS-PF-2016-72616 and Slovak APVV project under contract No. APVV-140598.
1. Pouget, F., Dacier, M., et al.: Honeypot-based forensics. In: AusCERT Asia Pacific
Information Technology Security Conference (2004)
2. Spitzner, L.: The honeynet project: trapping the hackers. IEEE Secur. Priv. 1(2),
15–23 (2003)
3. Spitzner, L.: Honeypots: Tracking Hackers. Addison-Wesley, Reading (2003)
4. Dionaea project. https://github.com/rep/dionaea. Accessed 20 Aug 2016
5. Joshi, R., Sardana, A.: Honeypots: A New Paradigm to Information Security. CRC
Press, Boca Raton (2011)
6. HonSSH project. https://github.com/tnich/honssh/wiki. Accessed 20 Aug 2016
7. Abbasi, F.H., Harris, R.: Experiences with a generation III virtual honeynet. In:
Telecommunication Networks and Applications Conference (ATNAC) 2009, pp.
1–6. IEEE, Australasian (2009)
8. Kippo project. https://github.com/desaster/kippo. Accessed 27 Aug 2016
9. Abdou, A.R., Barrera, D., Oorschot, P.C.: What lies Beneath? Analyzing automated SSH bruteforce attacks. In: Stajano, F., Mjølsnes, S.F., Jenkinson, G.,
Thorsheim, P. (eds.) PASSWORDS 2015. LNCS, vol. 9551, pp. 72–91. Springer,
Heidelberg (2016). doi:10.1007/978-3-319-29938-9 6
10. Nicomette, V., Kaâniche, M., Alata, E., Herrb, M.: Set-up and deployment of a
high-interaction honeypot: experiment and lessons learned. J. Comput. Virol. 7(2),
143–157 (2011)
11. Alata, E., Nicomette, V., Dacier, M., Herrb, M., et al.: Lessons learned from the
deployment of a high-interaction honeypot. arXiv preprint arXiv:0704.0858 (2007)
12. Sochor, T., Zuzcak, M.: Study of internet threats and attack methods using honeypots and honeynets. In: Kwiecień, A., Gaj, P., Stera, P. (eds.) CN 2014. CCIS, vol.
431, pp. 118–127. Springer, Heidelberg (2014). doi:10.1007/978-3-319-07941-7 12
P. Sokol and V. Kopčová
13. Sochor, T., Zuzcak, M.: Attractiveness study of honeypots and honeynets in internet threat detection. In: Gaj, P., Kwiecień, A., Stera, P. (eds.) CN 2015. CCIS,
vol. 522, pp. 69–81. Springer, Heidelberg (2015). doi:10.1007/978-3-319-19419-6 7
14. Canto, J., Dacier, M., Kirda, E., Leita, C.: Large scale malware collection: lessons
learned. In: IEEE SRDS Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems. Citeseer (2008)
15. Thonnard, O., Dacier, M.: A framework for attack patterns’ discovery in honeynet
data. Digital Invest. 5, 128–139 (2008)
16. Sokol, P., Kleinová, L., Husák, M.: Study of attack using honeypots and honeynets
lessons learned from time-oriented visualization. In: EUROCON 2015-International
Conference on Computer as a Tool (EUROCON), pp. 1–6. IEEE (2015)
17. Skrzewski, M.: Network malware activity – a view from honeypot systems. In:
Kwiecień, A., Gaj, P., Stera, P. (eds.) CN 2012. CCIS, vol. 291, pp. 198–206.
Springer, Heidelberg (2012). doi:10.1007/978-3-642-31217-5 22
18. IP-API.com service. http://ip-api.com. Accessed 20 Aug 2016
19. Sokol, P., Pekarcik, P., Bajtos, T.: Data collection and data analysis in honeypots
and honeynets. In: Proceedings of the Security and Protection of Information.
University of Defence (2015)
20. McHugh, M.L.: The chi-square test of independence. Biochemia Medica 23(2),
143–149 (2013)
21. Viera, A.J., Garrett, J.M., et al.: Understanding interobserver agreement: the
kappa statistic. Fam. Med. 37(5), 360–363 (2005)
22. Linux documentation for useradd. http://www.unix.com/man-page/all/0/
useradd. Accessed 22 Aug 2016
Towards a General Information Security
Management Assessment Framework
to Compare Cyber-Security of Critical
Infrastructure Organizations
Edward W.N. Bernroider, Sebastian Margiol(&), and Alfred Taudes
Institute for Information Management and Control,
Vienna University of Economics and Business,
Welthandelsplatz 1, 1020 Vienna, Austria
Abstract. This paper describes the development of an information security
framework that aims to comparatively assess the quality of management processes in the context of cyber-security of organizations operating within critical
infrastructure sectors. A design science approach was applied to establish a
framework artifact that consists of the four dimensions “Security Ambition”,
“Security Process”, “Resilience” and “Business Value”. These dimensions were
related to the balanced scorecard concept and information security literature.
The framework includes metrics, measurement approaches and aggregation
methods. In its adapted form, our framework enables a systematic compilation
of information security, and seeks to display the security situation of a focal firm
against the desired future states, industry benchmarks, and allows for an
investigation of interdependencies. The design science research process included
workshops, cyclic refinements of the instrument, pretests and the framework
evaluation within 30 critical infrastructure organizations. The framework was
found to be particularly useful as learning and benchmarking tool capable of
highlighting weaknesses, strengths, and gaps in relation to standards.
Keywords: BSC Cyber-security
Information security management
Critical infrastructure Design science 1 Introduction
Today’s organizations in the private and public sectors have become increasingly
dependent on Information and Communication Technologies (ICTs) to develop and
offer their services and products. While these ICTs offer considerable advantages, their
wide-spread access expose individuals, organizations and nations to risks, which in
particular include Internet-related security breaches [1]. A missing understanding of the
risk cultures and exposures related to developing and operating ICT can lead to significant negative impacts. Consequently, there is a natural interest of a wide range of
stakeholders including citizens and governments [2] to ensure that any organization in
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 127–141, 2016.
DOI: 10.1007/978-3-319-49944-4_10
E.W.N. Bernroider et al.
an economy, in particular those operating critical infrastructures, manage their ICT
risks appropriately. An infrastructure is considered to be critical when its maintenance
is essential for vital societal functions. A damage to a critical infrastructure, such as
energy supply or transportation [3], may have a significant negative impact for the
security of the country and the well-being of its citizens.
An important and growing area of research and standard development deals with
organizational-related cyber-security issues. Cyber-security has been defined by the
International Communications Union (ITU) to mean “a collection of tools, policies,
security concepts, security safeguards, guidelines, risk management approaches,
actions, training, best practices, assurance and technologies that can be used to protect
the cyber environment, and organization and user’s assets” [4]. The term can be
considered to holistically cover all ICT related threats, e.g., involve corrupting or
disrupting e-services or simply the information flow between people and organizations.
Organizational and user assets include all resources (information, people, technologies
and systems).
When characterizing cyber-security studies, one needs to understand what is
measured, why and for whom the measurement takes place [5]. In our case, we seek to
measure critical infrastructure organizations, which are of particular importance to the
stability of a nation or economy [2]. The motivation is to allow for a description and
comparison of cyber-security in terms of overall performance of information security
management (ISM) and its strategic security programs, which are the foundation for
achieving security on the technical infrastructure level. While technical scores are
important to understand the many facets of operational security including the configuration of firewalls, the effectiveness of intrusions detection systems [6] or the use of
proxy servers, strategic scores allow for an understanding of how well risks are being
predicted or managed, whether policy compliance is reached or business impact
analysis is sufficient enough. Finally, we seek to feedback information to the critical
infrastructure organizations and to serve external authorities and policy makers, who
can support cyber-security with a range of tools, e.g., research and development
An important element of cyber-security is to protect an entity based on given
security objectives against risks in the cyber-environment. Multi-dimensional framework approaches can be used to assess and control levels of performance [7], in
particular in relation to cyber-security [8]. The strategic and measurement focus of this
study made us consider balanced scorecards (BSCs) approaches, which offer a systematic analysis linking strategy with a set of measures [9–11]. Such scorecards typically display the current security situation against the desired future states in an
attempt to systematically manage cyber-security of an organization, which is also unit
of analysis for this study.
The aim of this paper is to build and test a framework that can be used to holistically measure the quality of information security management in the context of
cyber-security and allow for comparative assessments of organizations in critical
infrastructure sectors. This context requires to generally include the management of
ICT security in organizations with a focus on the strategic level holistically covering an
entire organization. We were also interested in the typical frameworks and their
metrics, measurement approaches and aggregation methods to operate a scorecard.
Towards a General Information Security Management Assessment
This scorecard should consider the requirements to be able to evaluate and compare a
given security-level, and had to be implementable by face-to-face interview techniques
implemented as part of a survey. The following section gives a preliminary theoretical
overview based on academic literature and standards to identify relevant prior research
focused on scorecards and their constituent elements. A design science approach is
used to develop an assessment framework as an artifact based on scorecard theory and
fieldwork including workshops and empirical validations.
2 Theoretical Background
Balanced Scorecards (BSCs)
As the aim of this paper is to create an information security framework based on the
Balanced Scorecard (BSC), we build on prior research on the BSC, which is considered
a well-established management scorecard in business practice. A study on the application and popularity of different management instruments in different economic
regions ranked the BSC as fifth popular management instrument worldwide. Within the
EMEA economic region (Europe, Middle East and Africa) the BSC was identified to be
the most popular management instrument [12]. The usefulness of the BSC method was
demonstrated for a variety of entities and different organizational settings including,
e.g., higher education [13], national health service [14], banking [15] and public sector
[16]. Another study conducted a systematic comparative research to relate the value of
the BSC, and the value of specific elements of it, to the context of the application [17].
The BSC appeared to be of value to all ten organizations analyzed (although in varying
The Balanced Scorecard (BSC) is a measurement system of an entities’ performance linking strategic objectives and measures across four different perspectives, and
promising strategy mapping between each of these perspectives [9, 10]. The original
standard dimensions describe how the Learning and Growth dimension (often also
referred to as future perspective) feeds into the Internal Business Process and Customer
dimensions, which impact the Financial dimension. The idea of a BSC is to find a set
of measures that maintain a balance between different dimensions or characteristics,
such as financial and non-financial aspects, short- and long-term objectives, lagging
and leading indicators, and internal and external performance perspectives [11, 14].
These measures should not only give an overview of the current or actual state of an
entities’ situation, but also the desired or planned state in terms of specific targets. The
gap between the actual against the planned state is the performance gap, which should
be closed in a certain time with an action plan that is structurally linked with the
according measures in the scorecard.
However, there are also critical reflections of the BSC [18, 19]. Some scientists
question the causality between different perspectives and their suitability to monitor
technological developments [19]. In the IT area the BSC was applied for e.g. strategic
information systems [20], in the ERP [21] and IT governance [22] contexts, and for
e-business development [23]. Linked or cascading Balanced scorecards were, for
example, used in IT performance management with a developmental and operational
E.W.N. Bernroider et al.
BSC [22]. In the context of organizational security, the BSC has been transformed into
an IT security scorecard [8, 24]. While some guidelines for the general transformation
of the BSC into the IT security context were given, specific guidelines on how to
design and implement metrics for different levels and characteristics of measurements
are generally missing.
Due to its high diffusion rate in practice, the BSC seems to be an appropriate
management instrument to capture and compare information security if it is adapted in
structure and content to the specific requirements.
Security Scorecards and Recommendations
For the composition of the information security framework, scorecard based approaches and general recommendations about information security management (ISM) were
taken into consideration. None of the identified sources were able to fulfill our
requirements directly, however the choice of a BSC as basis model for our framework
scorecard approach was supported and some of its subdomains were motivated.
The study of de Oliveira Alves et al. [25] considers the governance of information
security and therefore puts its primary focus on the strategic orientation. Their model is
also based on the basic BSC and proposes indicators which are derived from best
practice and which are also part of the CobiT and ISO/IEC 17799 frameworks.
The BSC concept also constitutes the foundation of the ISScoreCard of Huang, Lee and
Kao [24] which covers information security management of the manufacturing
industry. Measures for enhancement of security awareness and resilience against
attacks are assigned to their potential perspective. The BSC model for information
security by Herath et al. [8] seeks to implement performance management in relation to
IT security. The BSC study of Royer and Meints [26] covers identity management and
is inspired from the IT Infrastructure Library (ITIL) [27], CobiT [28] and other best
practice sources. The “Cybersecurity Health Check” system [29] is also based on a
BSC concept. Its aim is to determine information security by regarding the human
factor, especially measures for awareness building and training of information security.
This focus is anchored on the assumption that the human factor poses the weakest link
of a cybersecurity chain [29]. Another BSC based study applies selected metrics with
regard to ITIL, ISO/IEC 28002 and CobiT [30] for assessing the quality of information
security for web-services.
Two internationally accepted frameworks in the domain of information security
(NIST 800-55 and ISO 27001) have high relevance for the information security
domain. NIST disseminated various publications that are dedicated to information
security on different levels [31]. The NIST publication 800-55, for example, addresses
the performance evaluation of information security and proposes a three-step ascertainment of a target level [32]. The ISO published a series of standards on the topic
information security under the family name ISO 27000. Whilst ISO 2700 offers a
general overview about management systems for information security, 27001 is dedicated to the requirements for the establishment, operation, surveillance, maintenance
and continuous improvement of an information security management system (ISMS).
ISO 27002 is built on ISO standard 27001 and deals with essential activities for
creating and implementing a working ISMS.
Towards a General Information Security Management Assessment
3 Research Approach and Process
This study presents a designed, developed, and field-tested assessment framework that
followed the Design Science Research (DSR) approach by Peffers et al. [33].
Accordingly, the research process was structured by the process steps given in Table 1.
The three DSR cycles presented by Hevner [34] have been used to support different
steps in the research process.
Table 1. Research approach and process based on design science [33]
Research process
(1) Identify the
problem and
(2) Define
Objectives of a
(3) Design &
(4) Demonstration
(5) Evaluation
domain experts
With Chief
With CISOs
Design cycles
Internal and
external cycles.
External cycles
with application
domain experts
security experts)
(6) Communication With policy
makers and
domain experts
Workshop 1
Workshop 2
Workshops 3–4
Presentation event
For the first stage to “(1) Identify the problem and motivate”, we met with industry
consultants and security experts from the application domain to clarify the problem and
motivation for this research (workshop 1) followed by a consultation of academic and
practitioner literatures. This led to the consideration and adaptation of the BSC concept
to the security context. Thus, we engaged in relevance and rigor cycles, respectively.
For the second stage to “(2) Define Objectives of a Solution”, we engaged again with
representatives from the application domain (workshop 2) to engage in another relevance cycle. Next, we engaged in “(3) Design & Development” and revised our
framework through internal and external design cycles. The third activity covers the
design and development of an artifact, including a description of the artifact’s desired
functionality [33]. Again, the external design cycles were supported by workshops
(3–4) together with industry consultants and security experts. The stage “(4) Demonstration” was implemented by pre-testing the framework and meeting again to engage
E.W.N. Bernroider et al.
in another design cycle with the same domain experts. Activity four requires the
demonstration of the use of the artifact to solve the problem, followed by an evaluation
of its performance. The “(5) Evaluation” stage took place between June/2014 and
October/2014 in terms of one-hour face-to-face interviews with 30 organizations, and
was followed by another minor design cycle. After analyzing the results, we engaged in
“(6) Communication” in terms of a formal event, where we presented and discussed the
results together with policy makers and application domain experts.
4 Objectives, Design and Development of the Framework
Objectives of a Solution
The ISM assessment framework is intended to be a multidimensional, indicator based,
information security assessment system for critical infrastructure companies taking
ideas from the BSC. The most essential difference is that the BSC requires a very
specific context, while we seek to develop a universal instrument to compare the cyber
security states of critical infrastructure organizations (objective 1). This is the first
among five general objectives that were identified as being essential in order to meet
the specific requirements of this study (see Table 2). These general objective were
identified as a result of the first workshop.
Table 2. Five general framework objectives
(1) General applicability
(2) Representation of the whole
(3) Expressiveness of
(4) Multidimensional
measurability on homogenous
(5) Aggregation of
The framework has to be equally applicable for
companies in the critical infrastructure sector
The choice of perspectives and multi objective
definition should provide a complete representation of
the whole organization
Sub-dimensions should have sufficient expressiveness
to identify strengths and weaknesses in order to allow
for corrective measures
Each sub-dimension is represented by multiple
indicators that sufficiently describe the sub-dimension
with homogenous scales. Indicators need to be defined
in a way that exhaustively covers the sub-dimension on
desired abstraction levels
All sub-dimensions need to be aggregable with prior
defined criteria in order to enable statements on
different abstraction levels as well as to provide a
holistic overall assessment
The arrangement of perspectives allows communicating the overall situation.
Therefore a precise definition and appropriate amount of different, correlating, multiple
dimensions and targets of security relevant factors in terms of multi criteria decision
making is necessary [35]. The measurement of scorecard dimensions is thus based on
multiple modular subdomains. An aggregation of all measurements of an individual
Towards a General Information Security Management Assessment
subdomain enables an overall statement of a dimension. This results in different
abstraction layers. The representation of an individual dimension provides an overview,
whilst the inspections of individual and manageable sub-dimensions allows for more
detailed insights, which can then be used in order to take measures [36].
An essential restriction for the choice of indicators is their measurability. The
amount of indicators for describing the sub-dimensions may vary, a higher amount of
indicators for a sub-dimension does not attach more value to it, but reduces the error of
measurement. In order to summarize multiple indicators, scales are defined. Scaling
schemes allow for quantitative measurement of different dimensions which were
measured qualitative. Scales are measurement instruments that are used to numerically
identify a relative amount, position or the presence or absence of a relevant unit [37].
As indicators serve as basic elements, their proper definition is of high importance.
In order to receive an easy to describe result it is necessary to group results starting
on indicator levels to gradually reach a higher abstraction level. (See Fig. 2). For an
easy aggregation of the measurement results of individual sub-dimensions, it is necessary that they are balanced. Within the dimensions, their relevant indicators are
condensed to index values. For this aggregation, different approaches can be applied
dependent on knowledge and measurability of indicators. A unidimensional evaluation
method condenses all sub-dimensions of the framework to a single main-dimension.
This method, however, leads to a loss or falsification of information. Additionally,
sub-dimensions are not independent or mutual exclusive, what may result in a distortion. Nevertheless, the instrument is useful to classify the given degree of achieved
information security of a single organization or a group of organizations.
Framework Design
The ISM assessment framework consists of four main dimensions that are illustrated in
Fig. 1. The first dimension was called Security Ambition (SA) and refers to the
potential or future perspective of the BSC. The second dimension is called Security
Process (SP) and is based on the internal process perspective of the BSC. The third
Resilience (RE) perspective is based on the customer perspective of the BSC. Finally,
the Business Value (BV) refers to the financial or value perspective of the BSC. These
dimensions and their reasoning are described next.
Fig. 1. Structure of the Information Security Management (ISM) assessment framework
E.W.N. Bernroider et al.
According to NIST recommendations [31, 32], we included security guidelines as a
starting point which are usually realized through an information security policy (ISP).
This requirement is covered by the perspective “Security Ambition” within our model.
As second pillar, NIST proposes to incorporate efficiency and effectiveness of the
information security processes. The third pillar of NIST 800-55 covers security incidents that are directly influencing the business or mission level. This is covered by the
perspective “Resilience”. It is also important to assess whether the critical infrastructure
organization itself realizes what additional benefit information security offers. This is
realized within the “Business Value” dimension of our framework. This approach does
not only cover the recommendations of the basic balanced scorecard concept, but
additionally takes into consideration that investment in information security requires
transparency on cost and value side alike. Table 3 shows how these different dimensions can be related to other studies or frameworks from literature.
Security Ambition. This dimension covers all strategic and future-oriented ambitions
to preserve or raise the security level of an organization. This includes primarily
organizational factors (such as existence and quality of information security regulations
and risk management), as well as procedural efforts in the domain of security awareness, training and education. The area of knowledge management covers the ability of
an organization to assess development and outcome of new technologies. This ability is
an essential requirement to be able to respond adequately to former unknown threats. In
terms of general BSC dimensions, this category represents the future perspective [9]
and takes the fundamental importance of the information security policy according to
NIST 800-55 [31, 32] into account.
Security Processes. This dimension covers the efficiency and effectiveness of relevant
information security processes of an organization. It is composed of the sub dimensions: information security management systems (ISMS) [38], patch management,
change management, identity and access management, asset management, monitoring
and reporting, as well as incident management and related cause research and problem
analysis. In reference to the NIST 800-55 framework [31, 32], this dimension represents the “effectiveness/efficiency metrics of security service delivery” which assesses
whether the security measures were implemented properly, perform as expected and
deliver the expected outcome. NIST defines effectiveness as the robustness of the result
itself and efficiency as the timeliness of the result [32]. According to the BSC logic, the
quality of internal processes is examined, which are mapped by information security
processes in the context of this study [9].
Resilience. The resilience perspective determines an organization’s resilience in regard
to the required availability of service levels. This is of utterly importance for critical
infrastructure organizations. An organization’s ability to maintain its service delivery
mainly depends on how well the incident management is realized and how often and
rigorous audits and security analysis are realized. In order to be able to react on changes
in the availability of service, an emergency conception is required. It is thereby relevant
how well the conception is internalized and how well relevant contractual partners are
incorporated into this conception. Mapped to the BSC logic [9], resilience equals the
Towards a General Information Security Management Assessment
Table 3. Models and frameworks related to ISM framework dimensions
Enterprise security governance security dashboard [25]
Security performance (BSC) [24]
Enterprise identity management
Cybersecurity health check [29]
Quality of service security [30]
NIST publication 800-55 [31, 32]
ISO/IEC 27001/2 (ISO 17799)
[38, 42]
Dimensions (Number of indices)
Risk management (6)
Policy compliance (10)
Asset management (5)
Knowledge management (8)
Incident management (7)
Continuity management (10)
Security infrastructure (8)
Financial (5)
Customer (5)
Internal process (11)
Learning & growth (14)
Financial/monetary (3)
Business process (3)
Supporting process and ICT
infrastructure (3)
Information security, risk and
compliance (3)
Protection requisitions (2)
Defense-in-depth implementation (3)
ISMS establishment (4)
Security awareness & education (3)
Compliance (0)
Integrity (0)
Reliability (0)
Availability (2)
Confidentiality (3)
Implementation of security policy
Effectiveness/efficiency metrics of
security services delivery
Impact metrics of security events on
business/mission level
Security policy
Organizing information security
Asset management
Human resources security
Physical and environmental security
Communications and operations
Access control
Information security acquisition
development & maintenance
E.W.N. Bernroider et al.
Table 3. (continued)
Dimensions (Number of indices)
Information security incident
Business continuity management
customer perspective. The resilience perspective is further supported by the third pillar
of the cybersecurity framework of NIST [31].
Business Value. This dimension reflects how much value an organization attaches to
security. It captures how well information security is measured. The outcome as result
of the measurement is not the main focus of this dimension, but the efficiency and
effectiveness of the measuring system itself. The performance in this dimension shows
whether an organization has knowledge about, measures and operates its own information security level. Additionally it is important to assess whether the costs and the
value of information security are transparent. The NIST 800-55 recommendation [32]
is therefore complemented with a dimension that, in terms of BSC logic, reflects the
financial respectively the value perspective [9].
Frameworks in ICT evaluation and management are usually based on additive value
models. These models use multiple (often conflicting) attributes to maximize a single
quantity called utility or value. To aggregate single utilities and generate a super scale
or index, multiple single-attribute value functions are aggregated. This is most regularly
achieved by a simple additive weighting procedure [39], for example, in the Utility
Ranking or Value Method [40]. The aggregation is undertaken by a weighted sum of
single-attribute value functions. In the weighted sum method the overall suitability of
each alternative is thereby calculated by averaging the score of each alternative with
respect to every attribute with the corresponding importance weighting.
A value aggregation is possible on each level as all values should represent independent, unified and normed target values of the given company. This procedure is
called value-synthesis in context of utility analysis [40]. For aggregation of the data
different approaches can be applied, for example the Profile Distance Method [41].
A common method is to use a weighted average with normalized indicator values [36].
Normalization in this context refers to adjusting measurements on different scales to a
uniform scale. It should be noted that aggregation leads to information losses and bias.
If there is no information available for weighting, it is advisable to use a simple
average, without weighting. This results in average values for each group. Our model
uses a hierarchical indicator model and assumes each hierarchical level to be equally
weighted. That means that each abstraction level aggregates values independent of the
amount of indicators with the same weight into the aggregation. Due to easier handling
and comprehensibility it is common to use a linear aggregation method. However, this
Towards a General Information Security Management Assessment
assumes that all values are compensable with each other, which is does not hold in
reality [36]. Individual evaluation criteria represent different features which, in sum,
allow for an extensive and holistic evaluation of the information security level of a
participating company.
The hierarchical levels of the index are visualized in Fig. 2. The numbers in the
brackets refer to the number of questions within the questionnaire. The black boxes
display the sub-dimension-indexes, which together form the dimension-index. In total,
the scorecard is made up of 108 indicators that are grouped to 19 sub-dimensions which
are subsumed to four main dimensions that finally form the overall index.
Fig. 2. Hierarchical levels of the ISM framework
5 Framework Demonstration and Evaluation
The developed information security framework included contextual questions, which
were used to enable an additional qualitative assessment alongside the quantitative
assessments needed to create aggregate values (indices).
Upon finalization of the framework and the underlying questionnaire, two pre-tests
were carried out. The main aims of these pre-tests was to test the framework’s measurement instrument via structured face-to-face interviews in real-life environments and
demonstrate the validity of the instrument. It was tested whether the instrument is
comprehensible and how much time it takes to answer all questions diligently. For this
test run it was particularly interesting to understand whether the target person of the
interview, the Chief Information Security Officer (CISO), could answer all questions by
herself without consulting an additional knowledge source (colleagues or a database).
E.W.N. Bernroider et al.
The result of these pre-tests was used to further refine the questionnaire. It was
necessary to optimize the length of the questionnaire considering the time limitations to
complete it. All questions that could not be answered or were likely to need a
preparation time to be answered were removed. The pretests also demonstrated that is
was necessary to add some more contextual questions in order to avoid misinterpretation of some index values. After all adjustments were made, a second round of
pre-tests was carried out. The findings of this round did not require further refinements
of the structured questionnaire. Therefore the research instrument was found to be
ready to be employed.
Framework Evaluation
The data collection process with face-to-face interviews alongside a structure questionnaire was conducted during the time period from June 2014 to October 2014 in
Austria. In total, 30 companies participated, eight companies of the information and
communication technology industry, six companies of the energy supply sector, six
transportation companies, four within the banking sector, three from the health care
sector and three from other industries. Although the number of participants was not
very high, a better part of Austrians critical infrastructure companies were covered.
The questionnaire comprises 108 indicators in total, 34 indicators for the dimensions “Security Ambition”, 25 indicators for the Dimension “Resilience”, 40 indicators
for the dimension “Security Process” and 8 indicators for the dimension “Business
Value”. Additionally, the questionnaire uses descriptive complementary questions in
open and semi-structured form. All indicators are assessed with a 5 point Likert scale
with interval scaled answers ranging from 1 (excellent) to 5 (insufficient). This common scale allowed for the required uniform understanding of questions by all participants. It was openly stated to give a critical self-assessment and provide a satisfaction
level for these questions. The questionnaire also comprises some contextual questions
that allow for a qualitative assessment. For example the value of IT security is measured by considering the allocation of financial resources that are directly associated
with service delivery of security management.
6 Discussion and Conclusion
This paper proposed the development of a general framework for assessing the quality
of information security management driven by the needs of understanding the comparative levels of cyber-security of critical infrastructure organizations. A preliminary
literature research identified the BSC as a useful base model for developing such a
framework due to its wide recognition in practice, literature, its overall comprehensiveness and strategic management orientation. A design science research (DSR) approach based on the six stages by Peffers et al. [33] was used to develop an information
security framework, which took into account three DSR cycles presented by Hevner
[34]. These cycles assured relevance, rigor and effective design principles. In particular,
we involved subject experts and CISOs as the main stakeholders from the information
Towards a General Information Security Management Assessment
security domain. The DSR research process was concluded with a presentation to all
stakeholders including many of the interviewed CISOs, policy makers, and security and
critical infrastructure industry experts.
The resulting information security framework is deemed capable of realizing a
general information security status evaluation, which can be used to compare critical
infrastructure organizations. For this purpose, we used a range of 108 indicators that are
grouped to 19 sub-dimensions. However, in order to give these metrics additional
meaning and support understanding, it was necessary to complement the quantitative
questions with open qualitative questions that capture the special contextual situations.
More precisely quantitative evaluation serves to benchmark, whereas qualitative
questions can be used to explain and interpret scores. The scorecard in terms of the
quantitative estimators alone does not portray the complete security status. An example
for an open question is „Who reports to the management? – Describe the reporting
structure”. In contrast, a typical index question consists of an evaluation of the quality
of, for example an emergency concept, and the significance that is given to it by upper
(business level) management.
An important issue that needs to be clarified prior to the application of the assessment framework, is to accept it as comparative learning and assessment tool to reveal
relative weaknesses and strengths. As the questionnaire is based on a self-assessment, it
would easily be possible to whitewash potential problems. This has to be clarified prior
to its implementation on the organization and individual levels. As the CISO is the main
target person for the questionnaire, she is prone to experience a social desirability bias.
This bias refers to the tendency to give socially desirable responses [43].
Future work could include an evaluation of possible social desirability bias in order
to control it in data analysis. A further enhancement would be an incorporation of
companies’ resource dependencies in regard to cyber-security. This would result in a
more holistic view. It should be possible to cascade the framework for multiple
companies. Other future research could extend our work by comparing the proposed
framework with others in terms of applicability and accuracy. The proposed framework
needs a constant refinement in terms of both structure and suggested metrics that can be
applied throughout all different sectors within the critical infrastructure sector to allow
for comparisons. Additionally, different aggregation approaches could be used, in
particular to account for different weighting profiles.
Acknowledgment. We would like to thank Michael Stephanitsch for conducting the interviews
and especially Wolfgang Gattringer, Heiko Borchert and Wolfgang Rosenkranz for hosting the
workshops and engaging in the discussions.
1. Gottwald, S.: Study on Critical Dependencies of Energy, Finance and Transport
Infrastructures on ICT Infrastructure European Commission (2009)
2. Hare, F.: The cyber threat to national security why can’t we agree? In: Conference on Cyber
Conflict Proceedings, p. 15 (2010)
3. Council Directive 2008/114/EC. EU (2008)
E.W.N. Bernroider et al.
4. Gercke, M.: Understanding Cybercrime: A Guide for Developing Countries (2011)
5. Vaughn Jr, R.B., Henning, R., Siraj, A.: Information assurance measures and metrics - state
of practice and proposed taxonomy. In: Proceedings of the 36th Annual Hawaii International
Conference on System Sciences, p. 10 (2003)
6. Fink, G., O’Donoghue, K.F., Chappell, B.L., Turner, T.G.: A metrics-based approach to
intrusion detection system evaluation for distributed real-time systems. In: Proceedings of
the 16th International Parallel and Distributed Processing Symposium, p. 17. IEEE
Computer Society (2002)
7. Bernroider, E.W.N., Koch, S., Stix, V.: A comprehensive framework approach using
content, context, process views to combine methods from operations research for IT
assessments. Inf. Syst. Manag. 30, 75–88 (2013)
8. Herath, T., Herath, H., Bremser, W.G.: Balanced scorecard implementation of security
strategies: a framework for it security performance management. Inf. Syst. Manag. 27, 72–81
9. Kaplan, R.S., Norton, D.P.: The balanced scorecard - measures that drive performance.
Harv. Bus. Rev. 70, 8 (1992)
10. Kaplan, R.S., Norton, D.P.: Putting the balanced scorecard to work. Harv. Bus. Rev. 71, 14
11. Kaplan, R.S., Norton, D.P.: The Balanced Scorecard: Translating Strategy into Action.
Harvard Business School Press, Brighton (1996)
12. Rigby, D., Bilodeau, B.: Management Tools & Trends 2013. Bain & Company, Boston
13. Lawrence, S., Sharma, U.: Commodification of education and academic LABOUR—using
the balanced scorecard in a university setting. Crit. Perspect. Account. 13, 661–677 (2002)
14. Protti, D.: A proposal to use a balanced scorecard to evaluate information for health: an
information strategy for the modern NHS (1998–2005). Comput. Biol. Med. 32, 221–236
15. Littler, K., Aisthorpe, P., Hudson, R., Keasey, K.: A new approach to linking strategy
formulation and strategy implementation: an example from the UK banking sector. Int. J. Inf.
Manag. 20, 411–428 (2000)
16. Irwin, D.: Strategy mapping in the public sector. Long Range Plan. 35, 637–647 (2002)
17. Southern, G.: From teaching to practice, via consultancy, and then to research? Eur. Manag.
J. 20, 401–408 (2002)
18. Ahn, H.: Applying the balanced scorecard concept: an experience report. Long Range Plan.
34, 441–461 (2001)
19. Norreklit, H.: The balance on the balanced scorecard a critical analysis of some of its
assumptions. Manag. Account. Res. 11, 65–88 (2000)
20. Martinsons, M., Davison, R., Tse, D.: The balanced scorecard: a foundation for the strategic
management of information systems. Decis. Support Syst. 25, 71–88 (1999)
21. Roseman, M., Wiese, J.: Measuring the performance of ERP software – a balanced scorecard
approach. In: Australasian Conference on Information Systems, p. 10 (1999)
22. Grembergen, V.: The Balanced Scorecard and IT Governance. Inf. Syst. Control (2000)
23. Bernroider, E.W.N., Hampel, A.: An application of the balanced scorecard as a strategic
IT-controlling instrument for E-business development. In: International Conference on
Electronic Business, Singapore (2003)
24. Huang, S.M., Lee, C.L., Kao, A.C.: Balancing performance measures for information
security management. Ind. Manag. Data Syst. 106, 242–255 (2006)
25. de Oliveira Alves, G.A., da Costa Carmo, L.F.R., Almeida, A.C.R.D.: Enterprise security
governance; a practical guide to implement and control information security governance
Towards a General Information Security Management Assessment
(ISG). In: The First IEEE/IFIP International Workshop on Business-Driven IT Management,
BDIM 2006, pp. 71–80 (2006)
Royer, D., Meints, M.: Enterprise identity management – towards a decision support
framework based on the balanced scorecard approach. Bus. Inf. Syst. Eng. 1, 245–253
TSO: Introduction to the ITIL service lifecycle. The Stationary Office (TSO), Office of
Government Commerce (OGC), Belfast, Ireland (2010)
ISACA: COBIT - 4th Edition. Information Systems Audit and Control Foundation, IT
Governance Institute, Rolling Meadows, USA (2007)
Pan, M.S., Wu, C.-W., Chen, P.-T., Lo, T.Y., Liu, W.P.: Cybersecurity healt check. In:
Andreasson, K. (ed.) Cybersecurity: Public Sector Threats and Responses, p. 392. CRC
Press, Boca Raton (2012)
Charuenporn, P., Intakosum, S.: Qos-security metrics based on ITIL and COBIT standard for
measurement web services. J. Univ. Comput. Sci. 18, 24 (2012)
NIST: Framework for Improving Critical Infrastructure Cybersecurity (2014)
Chew, E., Swanson, M., Stine, K., Bartol, N., Brown, A., Robinson, W.: NIST Special
Publication 800-55 Information Security. NIST, Gaithersburg (2008)
Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research
methodology for information systems research. J. Manag. Inf. Syst. 24, 45–77 (2007)
Hevner, A.R.: A three cycle view of design science research. Scand. J. Inf. Syst. 19, 4 (2007)
Bernroider, E.W.N., Mitlöhner, J.: Characteristics of the multiple attribute decision making
methodology in enterprise resource planning software decisions. Commun. IIMA 5, 49–58
Merz, M.: Entwicklung einer indikatorenbasierten Methodik zur Vulnerabilitätsanalyse für
die Bewertung von Risiken in der industriellen Produktion. KIT Scientific Publishing (2011)
Atteslander, P.: Methoden der empirischen Sozialforschung. Erich Schmidt Verlag (2008)
ISO/IEC: The ISMS family of standards (2700X). Joint Technical Committee ISO/IEC
JTC 1, Information technology, Subcommittee SC 27, IT Security techniques (2014)
Yoon, K.P., Hwang, C.-L.: Multiple attribute decision making: an introduction. Sage
University Paper series on Quantitative Applications in the Social Sciences. Sage, Thousand
Oaks (1995)
Zangemeister, C.: Nutzwertanalyse in der Systemtechnik. Wittemann’sche Verlagsbuchhandlung, München (1976)
Bernroider, E.W.N., Stix, V.: Profile distance method - a multi-attribute decision making
approach for information system investments. Decis. Support Syst. 42, 988–998 (2006)
Sahibudin, S., Sharifi, M., Ayat, M.: Combining ITIL, COBIT and ISO/IEC 27002 in order
to design a comprehensive IT framework in organizations. In: Second Asia International
Conference on Modeling and Simulation, AICMS 2008, pp. 749–753 (2008)
Grimm, P.: Social Desirability Bias. Wiley International Encyclopedia of Marketing. Wiley,
Hoboken (2010)
Advanced Manufacturing and
Management Aspects
From Web Analytics to Product Analytics:
The Internet of Things as a New Data Source
for Enterprise Information Systems
Wilhelm Klat, Christian Stummer(B) , and Reinhold Decker
Faculty of Business Administration and Economics, Bielefeld University,
Universitaetsstr. 25, 33615 Bielefeld, Germany
Abstract. The internet of things (IoT) paves the way for a new generation of consumer products that collect and exchange data, constituting
a new data source for enterprise information systems (EIS). These IoTready products use built-in sensors and wireless communication technologies to capture and share data about product usage and the environment
in which the products are used. The dissemination of the internet into
the physical world of everyday products thus establishes new opportunities to apply methods well-established in web analytics to IoT-products,
allowing enterprises to tap into a new and rich source of consumer data.
In this paper we examine technical challenges of enabling everyday products to generate consumer data for EIS and discuss the application of
web analytics methods to IoT-ready consumer products.
Keywords: Internet of things
Information Systems
Product analytics
A growing number of recently introduced new consumer products are able to
sense their environment and share data with users, other products, and companies via the internet of things (IoT). Examples are LG’s refrigerator “Smart
ThingQ”, VW ’s smart minivan “BUDD-e”, Sleep Number ’s mattress “It Bed”,
or Verbund ’s energy monitoring and controlling system “Eco-Home”. These
products have in common that they extend the basic functionalities of regular products with the ability to collect and share data [2,22,27,28]. We call
this new category of data-collecting and -sharing products IoT-ready products
(or for short, IoT-products). The IoT was not established at a specific point in
time; rather, it has emerged in a continuous, and still ongoing, process. It can
be interpreted as “a global network infrastructure, linking physical and virtual
objects through the exploitation of data capture and communication capabilities” [5]. Accordingly, for the purpose of this paper we define an IoT-product as
a consumer product that autonomously collects and exchanges consumer data.
c IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 145–155, 2016.
DOI: 10.1007/978-3-319-49944-4 11
W. Klat et al.
The IoT is a potential game changer for nearly every business and it has
received attention by both academia and practitioners around the world (for a
recent overview of research trends and challenges see [52,53]). From the perspective of enterprise information systems (EIS), the IoT will be the key to
unlocking the full potential of EIS (see, for example, [40]). Conceptual designs
for corresponding EIS have already been outlined in prior work (e.g., [26,50]).
Our approach differs from these approaches by focusing on consumer data that
is collected by the physical IoT-products themselves in the consumer’s environment. While websites, social networks, search engines, and other traditional
sources for data generation in sales market research are built on the traditional
internet and the World Wide Web, IoT-products provide enterprises with access
to a new kind of data. With traditional products, enterprises lose direct access to
their products at the point of sale [27]. With IoT-products, however, enterprises
have the opportunity to gain insights into the actual interaction of customers
with their products as well as to collect additional data from the customer’s
environment, which provides these enterprises with access to large amounts of
longitudinal data in a way that has not been previously available [37]. The data
collected by IoT-products primarily supports decision making in marketing and
sales, but other enterprise divisions and EIS-subsystems may also benefit from
this data (see Fig. 1). The car manufacturer Tesla, for instance, has already captured data about 780 million miles of driving from its IoT-ready cars, using the
data to improve the technical functionality of the autopilot and to optimize the
user interface [8].
Fig. 1. IoT-products as a new data source for Enterprise Information Systems (see,
for example, [32]). DSS = Decision Support Systems, MIS = Management Information
System, ESS = Executive Support Systems, TPS = Transaction Processing Systems,
KWS = Knowledge Work Systems
The new opportunities for data collection and analysis from connected everyday products emerge from a concept we refer to as “product analytics” (see also
[20]). Product analytics is the IoT-equivalent of web analytics and aims at the
The Internet of Things as a New Data Source for EIS
autonomous collection and analysis of usage data from the customers’ environment. IoT-products thus constitute a promising platform for the application of
analytics technologies that have already proven their effectiveness on the web
and on mobile phones. The research contribution of this paper therefore lies in
the exploration and discussion of the new data collection opportunities embodied
by IoT-products for enterprise information systems.
Accordingly, the remainder of the paper is organized as follows: Sect. 2 deals
with technical aspects of IoT-products and outlines major challenges of transforming traditional products into IoT-ready ones with respect to technologies for
data collection, data exchange, and energy supply. Section 3 then focuses on the
vital aspect of data collection and discusses the application of established methods from web analytics to IoT-products. The paper concludes with a summary
and suggestions for promising research directions in Sect. 4.
Technical Foundation of IoT-Products
The application of product analytics to physical consumer products requires
them to be ready for the IoT. The core components for this purpose are (i)
sensors and processors for data collection and processing, (ii) transceivers for
wireless data exchange, and (iii) energy supply (see Fig. 2).
Fig. 2. Components of IoT-products necessary for product analytics
Integrating these IoT-components seems rather trivial for large-sized products that provide sufficient space, can draw on a genuine energy supply, and
already have built-in sensors or communication technologies (e.g., in-home health
care devices, as described in [33]). Readying everyday products such as shoes,
watches, or flower pots for the IoT, though, still constitutes a technical challenge. However, such small everyday IoT-products in particular exhibit a huge
potential for generating rich consumer data [1]. Tracking a consumer’s number
of daily footsteps may serve as an illustrative example. A smartphone infers this
data by tracking movement with GPS, accelerometers, and other built-in sensors. IoT-ready shoes, in contrast, are much closer to the event of interest and
W. Klat et al.
therefore can capture footsteps directly and more precisely including the actual
number of steps, the pressure applied, the weight distribution between the feet,
and much more. The rich data from IoT-ready shoes therefore offer enterprises
from industries such as footwear manufacturing to healthcare a highly valuable
opportunity to better understand individual usage patterns and to offer databased services. The IoT-pioneer Orpyx, which offers IoT-ready shoe inserts that
transform traditional shoes into IoT-shoes, for instance, already uses the data
generated by these inserts to remotely diagnose certain diseases at an early stage.
The continuous progress in computer miniaturization, processing speed, and
storage will pave the way to integrating IoT-components into small products
even if they so far lack any electronic support [51]. In principle, IoT-products
can autonomously acquire data from three sources, namely, internally from builtin sensors, externally from other IoT-products in close proximity, and from the
EIS itself. Moreover, enterprises may also indirectly access data generated by
third-party IoT-products that has been mutually exchanged with their own
Small, everyday products can typically host only one, if any, wireless communication technology. Each technology has its advantages and drawbacks and
is suitable for specific fields of application (for an overview, see Fig. 3). In the
recent past the development of wireless communication technologies has focused
on personal and body area networks as well as on wireless sensor networks.
Bluetooth Low Energy, Wi-Fi Direct, and Near Field Communication are particularly promising technologies for small and mobile IoT-products. In general,
the assessment of communication technologies for a specific IoT-product should
take into account communication distance, required data throughput, latency,
reliability, practicability for customers, and technology dissemination [7].
Especially for miniaturized and mobile IoT-products, energy consumption
and supply constitutes a significant technical challenge. Although batteries may
be an inexpensive solution, the increasing number of IoT-products will render periodic recharging and replacement of numerous batteries difficult [41,45].
A promising approach to address this issue is energy harvesting. Energyharvesting–enabled IoT-products are capable of autonomously harnessing energy
from the environment and converting it to electrical energy. Most energyharvesting methods for (miniaturized) IoT-products rely on solar [6,38], wind
[14,18], differences in temperatures (e.g., body heat) [15,25], movement (e.g.,
body movement) [13,19,30], vibrations [3], radio frequency [24,29,36], or microbial activity [54]. Piezo-electric materials embedded in IoT-ready shoes or light
switches, for instance, generate sufficient electrical energy from controllable
mechanical deformation to send radio signals over short distances to smartphones or IoT-ready light bulbs [34,42]. For a more in-depth overview of energyharvesting techniques in wireless sensor networks, we refer to [9,45,47].
The Internet of Things as a New Data Source for EIS
Fig. 3. Wireless communication technologies categorized by their typical field of application. WBAN = Wireless Body Area Network, WPAN = Wireless Personal Area Network, WLAN = Wireless Local Area Network, WMAN = Wireless Metropolitan Area
Network, WWAN = Wireless Wide Area Network. Year of market introduction in
Data Collection with Product Analytics
Evolution of Product Analytics
“Connected” products are a new phenomenon on the market side of enterprises.
Until now, enterprises have generally lost direct access to their products at the
point of sale (Mayer 2010). In order to gain deeper insights into actual product usage, product condition, and other information of interest, enterprises had
to conduct resource-intense market studies and maintain extensive consumer
dialogs, which leave little room for trials and feedback on a large scale [22]. For
the purpose of collecting data for their EIS, enterprises more often than not
resort to the internet, which has thus become a popular source for data mining, including online consumer ratings, reviews, discussions, subscriptions, and
other consumer-related actions. Modern web technologies also reduce the common information asymmetry between consumers and enterprises [4,12]. Still, the
internet has undergone development from the initial stationary internet, to the
current mobile internet stage, and on to the already emerging internet of things
(for an overview, see Fig. 4).
The stationary internet and web analytics have allowed capturing of consumer
behavior in a novel way through company websites. Google, comScore, Adobe, and
Mixpanel are currently the worldwide leading providers of well-established web
W. Klat et al.
Fig. 4. The evolution of the internet of things and product analytics
analytics technologies. Although the web has significantly shortened the distance
between enterprises and consumers, both physically and emotionally, to just a
click [16], web analytics still requires consumers to generate feedback data actively
on the web. The dissemination of smartphones and the mobile internet have further increased the time consumers are close to enterprises, because, in contrast
to a personal computer, smartphones can be used almost anytime and anywhere
(for an application example, see [48]). Thus, mobile analytics have provided new
opportunities. With regard to the mobile internet, Google Analytics, Flurry Analytics, Crashlytics, and HockeyApp are popular examples of tools that capture
the consumer’s interaction with mobile apps and mobile web browsers. Following these predecessors, the current evolution stage of the internet, IoT, is about
to establish a permanent connection between consumers and enterprises. In the
future, consumers will be surrounded by IoT-products that permanently capture
the consumer’s interactions with IoT-products, everyday objects, and other persons. With the expansion of the internet from computers and smartphones to
everyday things, the application field of tools used in web and mobile analytics
likewise expands [4,12]. Thus, IoT-products constitute a promising platform to
collect data about the actual product usage.
We refer to the IoT-counterpart of web and mobile analytics as “product
analytics” because the medium for customer interaction is not a web browser
but the physical product itself and, if available, a corresponding mobile app
that functions as a remote control for the IoT-product. Although a wide range
of IoT-ready products already exists on the market, the application of product
analytics is still in an infant stage; standardized tools and best practices for
product analytics are likely to emerge in the upcoming years. In the remainder
of this section we explore established methods from web analytics and discuss
their application to IoT-products.
Log Files
Web servers store a log of page requests including data such as the originating
internet protocol (IP) address, date and time of the request, referrer, and some
information about the device from which the pages are requested [16]. These log
files serve as a data foundation for the most common metrics such as the frequency and duration of visits, visitor paths, demographic and geographic visitor
The Internet of Things as a New Data Source for EIS
information, referring websites and keywords, operating system statistics, and
so forth. Data collection with log files can be executed on both the server and
the client level [43]. The collected data helps enterprises to understand consumer
behavior such as how consumers react to product information, how they make
purchase decisions, which consumer segment is most likely to make product purchases, and reasons for consumers to bounce from (fail to follow through with)
the purchase process. The insights extracted from log files are used to support
decision making in various enterprise divisions and levels including communication or pricing decisions in marketing, product feature decisions in product
development, training of sales representatives and technicians, etc.
IoT-products offer an attractive platform for the autonomous collection of
product usage data in log files that are automatically pushed to EIS at intervals
or on request. The data from built-in sensors can be analyzed in order to determine the frequency and intensity of product usage, usage patterns and triggers
of product abandonment, product malfunctions, incorrect product usage, and
much more. Through an EIS, these insights can be utilized to predict product maintenance and the topic of inbound calls in call centers [37], analyze the
sequence of events on the customer journey [31], maintain direct relationships
with consumers throughout the whole product life cycle [21], capture longitudinal data for customer relationship management [17,39], or apply novel business
models that consider usage behavior [11]. Amazon, for instance, uses log files to
offer ebook authors per-page payouts. Another promising field of application for
information being derived from log files is product optimization. Log files are
an established source for capturing the behavior of website visitors in so-called
A/B tests. In these experiments, the performance of two operational versions of
a website that differ in only one single variable such as a logo or a button is compared, which ultimately allows testing of hypotheses with respect to conversion
rates and other visitor actions of interest. When applied to IoT-products, experimental tests of variations in digital user interfaces and product features can
deliver information for EIS about consumer preferences, latent needs, cognitive
capabilities of using the product, and so forth. Log file data from IoT-products
can also provide insights about the consumer’s environment by applying association rules to identify complementary products that consumers frequently use
alongside their own IoT-products. Such association rules are currently used in
the web to relate pages or objects that are frequently referenced together in a
single session [43]. A common field of application is product recommendations
in web shops, referring to products that “other customers also bought” using
association rules to identify products that are complementary to items in the
shopping cart of the consumer. In the IoT context, insights on products in the
consumer’s environment can also serve as a foundation to determine cross- and
upselling opportunities and, moreover, they can support decision making about
technical interfaces to ensure compatibility with complementary products.
W. Klat et al.
Tagging is a more active method of capturing IoT-product usages. In the traditional internet, tagging describes the placement of invisible images or, more
recently, pixels, in specific sections of a website or email that trigger actions when
they are loaded by web browsers or email clients [16]. Unlike the rather continuous
and passive data collection in log files, tagging data is pushed to the EIS in realtime if a predefined user action is tracked. When applied in IoT-products tagging
is a promising method for enterprises to collect additional real-time data that can
automatically trigger actions in different enterprise divisions through the EIS. As
an example, IoT-product tags may indicate continued usage difficulties on the consumer’s side and notify customer service to proactively contact the consumer for
assistance. Real-time IoT-tagging can also be used to remotely personalize IoTproducts with respect to the consumer’s usage patterns, similar to website personalization [35]. Some vehicle insurance enterprises such as Metromile already offer
“pay how you drive (PHYD)” policies wherein the driving behavior of consumers
is taken into account in calculating premiums. A small box installed in the car of
the consumer tracks the driving behavior, including driving speed, braking behavior, and turns, and sends this data to the insurance providers. Thus, tagging can
instantly reveal infringements such as driving through a red traffic light, allowing
the enterprise to aline premiums with actual risk.
The internet of things gives birth to a novel category of consumer products that
can both collect data and share data. These so-called IoT-products constitute
a new source of real-time consumer data for EIS. They use built-in sensors and
wireless communication technologies to capture actual product usage and push
data to the EIS. The current expansion of the internet from the web to physical
products provides an opportunity to apply well-established methods from web
analytics to IoT-products. This paper contributes to the field of research in the
intersection between the IoT and EIS by highlighting new data collection opportunities for EIS. By integrating microsensors, processors, wireless communication
technologies, and energy supplies into traditional products, enterprises may gain
real-time access to data beyond the point of sale. In particular, everyday consumer products may serve as a platform for product analytics, as consumers use
them frequently and such products are typically close to the event of interest.
Tracking and analyzing actual product usage can be implemented passively with
log files or actively with tagging. Both approaches provide enterprises with the
means to extract insights from rich product usage data with methods such as
A/B testing and association rules.
While our work has focused on marketing and sales, future research should
also analyze the opportunities of product analytics for other enterprise divisions.
A second promising direction for further research in IoT and EIS lies in the development of frameworks that allow easy cross-platform application of product analytics in IoT-products (see, for example, [23,46]). Such platforms will ultimately
The Internet of Things as a New Data Source for EIS
be necessary for product analytics to become as successful and popular as its
counterpart in the web. Finally, IoT-products pose security and privacy challenges (see, for example, [22,44,49]). In the context of product analytics, further
works on data survivability, intrusion detection, and data authentication seem
to be particularly worthwhile [10].
1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Comput. Netw. 38, 393–422 (2002)
2. Bauer, S.R., Mead, P.: After you open the box: making smart products more usable,
useful, and desirable through interactive technology. Des. Manag. J. 6, 21–26 (1995)
3. Beeby, S.P., Tudor, M.J., White, N.M.: Energy harvesting vibration sources for
microsystems applications. Meas. Sci. Technol. 17(12), R175–R195 (2006)
4. Bohn, J., Coroama, V., Langheinrich, M., Mattern, F., Rohs, M.: Living in a world
of smart everyday objects: social, economic, and ethical implications. J. Hum. Ecol.
Risk Assess. 10, 763–786 (2004)
5. CASAGRAS: RFID and the inclusive model for the internet of things. Final report
for EU Framework 7 Project No. 216803 (2009)
6. Chirap, A., Popa, V., Coca, E., Potorac, D.A.: A study on light energy harvesting
from indoor environment: the autonomous sensor nodes. In: Proceedings of IEEE
International Conference on Development and Application Systems (DAS), pp.
127–131. IEEE Press, New York (2014)
7. Chong, C.Y., Kumar, S.P.: Sensor networks: evolution, opportunities, and challenges. Proc. IEEE 91, 1247–1256 (2003)
8. Coren M.J.: Tesla has 780 million miles of driving data, and adds
another million every 10 hours. http://qz.com/694520/tesla-has-780-million-milesof-driving-data-and-adds-another-million-every-10-hours/
9. Dewan, A., Ay, S.U., Karim, M.N., Beyenal, H.: Alternative power sources for
remote sensors: a review. J. Power Sources 245, 129–143 (2014)
10. Di Pietro, R., Guarino, S., Verde, N.V., Domingo-Ferrer, J.: Security in wireless
ad-hoc networks: a survey. Comput. Commun. 51, 1–20 (2014)
11. Dover, C.: Worldwide SaaS Enterprise Applications 2014–2018 Forecast and 2013
Vendor Shares. IDC Research, Framingham (2014)
12. Filipova-Neumann, L., Welzel, P.: Reducing asymmetric information in insurance
markets: cars with black boxes. Telematics Inform. 27, 394–403 (2005)
13. Gorlatova, M., Sarik, J., Grebla, G., Cong, M., Kymissis, I., Zussman, G.: Movers
and shakers: kinetic energy harvesting for the internet of things. IEEE J. Sel. Areas
Commun. 33, 1624–1639 (2015)
14. Hsiao, C.C., Jhang, J.W., Siao, A.S.: Study on pyroelectric harvesters integrating
solar radiation with wind power. Energies 8, 7465–7477 (2015)
15. Hoang, D.C., Tan, Y.K., Chng, H.B., Panda, S.K.: Thermal energy harvesting from
human warmth for wireless body area network in medical healthcare system. In:
Proceedings of International Conference on Power Electronics and Drive Systems
(PEDS), pp. 1277–1282. IEEE Press, New York (2009)
16. Jansen, B.J.: Understanding User-web Interactions Via Web Analytics. Morgan &
Claypool, London (2009)
17. Jayachandran, S., Sharma, S., Kaufman, P., Raman, P.: The role of relational
information processes and technology use in customer relationship management.
J. Mark. 69, 177–192 (2005)
W. Klat et al.
18. Kamalinejad, P., Mahapatra, C., Sheng, Z., Mirabbasi, S., Leung, V.C.,
Liang, G.Y.: Wireless energy harvesting for the internet of things. IEEE Commun. Mag. 53, 102–108 (2015)
19. Khaligh, A., Zeng, P., Zheng, C.: Kinetic energy harvesting using piezoelectric
and electromagnetic technologies: state of the art. IEEE Trans. Ind. Electron. 57,
850–860 (2010)
20. Klat, W., Decker, R., Stummer, C.: Marketing management in the era of the internet of things. Working paper, Faculty of Business Administration and Economics,
Bielefeld University (2016)
21. Konana, P., Ray, G.: Physical product reengineering with embedded information
technology. Commun. ACM 50, 72–78 (2007)
22. Körling, M.: Smart products: why adding a digital side to a toothbrush could make
a lot of sense. Ericsson Bus. Rev. 18, 26–31 (2012)
23. Kryvinska, N., Strauss, C.: Conceptual model of business service availability vs.
interoperability on collaborative IoT-enabled eBusiness platforms. In: Bessis, N.,
Xhafa, F., Varvarigou, D., Hill, R., Li, M. (eds.) Internet of things and Intercooperative Computational Technologies for Collective Intelligenc. SCI, vol. 460,
pp. 167–187. Springer, Heidelberg (2013). doi:10.1007/978-3-642-34952-2 7
24. Lu, X., Wang, P., Niyato, D., Kim, D.I., Han, Z.: Wireless networks with RF
energy harvesting: a contemporary survey. IEEE Commun. Surv. Tutor. 17, 757–
789 (2015)
25. Lu, X., Yang, S.H.: Thermal energy harvesting for WSNs. In: Proceedings of IEEE
International Conference on Systems Man and Cybernetics (SMC), pp. 3045–3052.
IEEE Press, New York (2010)
26. Ma, C., Wang, J.: Enterprise information management system integration based
on internet of things technology. Manag. Eng. 22, 12–15 (2016)
27. Mayer, P.: Economic aspects of smartproducts. Whitepaper, Institute of Technology Management at the University of St. Gallen (2010)
28. Meyer, G.G., Buijs, P., Szirbik, N.B., Wortmann, J.C.: Intelligent products for
enhancing the utilization of tracking technology in transportation. Int. J. Oper.
Prod. Manag. 34, 422–446 (2014)
29. Mishra, D., De, S., Jana, S., Basagni, S., Chowdhury, K., Heinzelman, W.: Smart
RF energy harvesting communications: challenges and opportunities. IEEE Commun. Mag. 53, 70–78 (2015)
30. Mitcheson, P.D., Yeatman, E.M., Rao, G.K., Holmes, A.S., Green, T.C.: Energy
harvesting from human and machine motion for wireless electronic devices. Proc.
IEEE 96, 1457–1486 (2008)
31. Norton, D.W., Pine, B.J.: Using the customer journey to road test and refine the
business model. Strat. Leadersh. 41, 12–17 (2013)
32. Olson, D.L., Kesharwani, S.: Enterprise Information Systems: Contemporary
Trends and Issues. World Scientific, Singapore (2010)
33. Pang, Z., Zheng, L., Tian, J., Kao-Walter, S., Dubrova, E., Chen, Q.: Design of
a terminal solution for integration of in-home health care devices and services
towards the internet-of-things. Enterp. Inf. Syst. 9, 86–116 (2015)
34. Paradiso, J.A., Feldmeier, M.: A compact, wireless, self-powered pushbutton controller. In: Abowd, G.D., Brumitt, B., Shafer, S. (eds.) UbiComp 2001. LNCS, vol.
2201, pp. 299–304. Springer, Heidelberg (2001). doi:10.1007/3-540-45427-6 25
35. Pierrakos, D., Poliouras, G., Papatheodorou, C., Spyropoulos, C.D.: Web usage
mining as a tool for personalization: a survey. User Model. User-Adap. Inter. 13,
311–372 (2003)
The Internet of Things as a New Data Source for EIS
36. Piñuela, M., Mitcheson, P.D., Lucyszyn, S.: Ambient RF energy harvesting in
urban and semi-urban environments. IEEE Trans. Microw. Theory Tech. 61, 2715–
2726 (2013)
37. Porter, M.E., Heppelmann, J.E.: How smart, connected products are transforming
competition. Harvard Bus. Rev. 92, 64–88 (2014)
38. Raghunathan, V., Kansal, A., Hsu, J., Friedman, J., Srivastava, M.: Design considerations for solar energy harvesting wireless embedded systems. In: 4th International Symposium on Information Processing in Sensor Networks (IPSN), pp.
457–462. IEEE Press, New York (2005)
39. Rigby, D.K., Reichheld, F.F., Schefter, P.: Avoid the four perils of CRM. Harvard
Bus. Rev. 80, 101–109 (2002)
40. Romero, D., Vernadat, F.: Enterprise information systems state of the art: past,
present and future trends. Comput. Ind. 79, 3–13 (2016)
41. Shebli, F., Dayoub, I., M’foubat, A.O., Rivenq, A., Rouvaen, J.M: Minimizing
energy consumption within wireless sensor networks using optimal transmission
range between nodes. In: Proceedings of IEEE International Conference on Signal
Processing and Communications (ICSPC), pp. 105–108. IEEE Press, New York
42. Shenck, N.S., Paradiso, J.A.: Energy scavenging with shoe-mounted piezoelectrics.
IEEE Micro 21, 30–42 (2001)
43. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery
and applications of usage patterns from web data. SIGKDD Explor. 1, 12–23 (2000)
44. Strazdins, G., Wang, H.: Open security and privacy challenges for the internet of
things. In: Proceedings of 10th International Conference on Information, Communications and Signal Processing (ICICS). IEEE Press, New York (2015)
45. Sudevalayam, S., Kulkarni, P.: Energy harvesting sensor nodes: survey and implications. IEEE Commun. Surv. Tutor. 13, 443–461 (2011)
46. Tiwana, A., Konsynski, B., Bush, A.A.: Platform evolution: coevolution of platform
architecture, governance, and environmental dynamics. Inf. Syst. Res. 21, 675–687
47. Vullers, R.J.M., Schaijk, R.V., Visser, H.J., Penders, J., Hoof, C.V.: Energy harvesting for autonomous wireless sensor networks. IEEE Solid-State Circ. Mag. 2,
29–38 (2010)
48. Weber, M., Denk, M., Oberecker, K., Strauss, C., Stummer, C.: Panel surveys go
mobile. Int. J. Mob. Commun. 6, 88–107 (2008)
49. Weber, R.H.: Internet of things: new security and privacy challenges. Comput. Law
Secur. Rev. 26, 23–30 (2010)
50. Wei, Z.: Framework model on enterprise information system based on internet of
things. Int. J. Intell. Inf. Syst. 3, 55–59 (2014)
51. Weiser, M.: The computer for the 21st century. Sci. Am. 265, 94–104 (1991)
52. Whitmore, A., Agrarwal, A., Xu, L.D.: The internet of things: a survey of topics
and trends. Inf. Syst. Front. 17, 261–274 (2015)
53. Xu, L.D., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans.
Ind. Inform. 10, 2233–2243 (2014)
54. Yang, F., Wang, K.C., Huang, Y.: Energy-neutral communication protocol for
very low power microbial fuel cell based wireless sensor network. IEEE Sens. J. 15,
2306–2315 (2015)
Enterprise Information Systems
and Technologies in Czech Companies
from the Perspective of Trends in Industry 4.0
Josef Basl(&)
University of Economics, Prague, Czech Republic
[email protected]
Abstract. The paper deals with aspects of ICT innovation based on the
development of the internet of things in industrial companies. The article presents the main results of a pilot survey carried out in a number of Czech
companies. They show the current understanding of Industry 4.0 principles and
the penetration these trends in companies, including the penetration level of the
main IT trends and integration role of enterprise information systems application
in Industry 4.0.
Keywords: Enterprise information system Internet of things Industry 4.0 Innovation Information and communication technology Cyber physical
1 Introduction
Long-term forecasts and trends of global development show that information and
communication technology will continue to play a leading role among innovation
technologies. Trends such as big data and cloud computing are very important today,
but it seems that they will remain very important over the next 10-15 years. For
example, the document Global trends 2030 [7] emphasizes ICT as one of four key
technological areas:
• Information and communication technology
• Technologies pertaining to the security of vital resources (food, water, and energy
• New health technologies
• New manufacturing and automation technologies.
Manufacturing and automation technologies are crucial for the deployment of ICT
and at the same time they also represent one of the key segments of the portfolio of the
Czech economy with a strong influence on the Czech labour market [2]. The importance of ICT in the future is also emphasized in the survey done by the OECD [14] for
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 156–165, 2016.
DOI: 10.1007/978-3-319-49944-4_12
Enterprise Information Systems
2 Theoretical Background
There is no doubt that ICT play a key role in the development of the whole of society.
ICT will be shaping the global economy, social, and military developments as well as
the world community’s actions pertaining to the environment by 2030. Information
technology will provide global access and pervasive services; social media and
cybersecurity will be large new markets. Among the TOP 10 strategic technology
trends according to the Gartner Group are [6]:
• The digital mesh
– The device mesh
– Ambient user experience
– 3D printing machines
• Smart machines
– Information of everything
– Advanced machine learning
– Autonomous agents and things
• The new IT reality
– Adaptive security architecture
– Mesh app and service architecture
– IoT architecture and platforms.
Advanced system architecture and virtual reality are examples of symptoms of the
new important wave of changes. The time when it was trendy to talk about Web 2.0 or
Enterprise 2.0 is partly behind us. At that time Enterprise 2.0 was dedicated to the use
of emergent social software platforms within companies, or between companies and
their partners or customers. The tools and services that employ Web 2.0 techniques
such as tagging, ratings, networking, user commenting and discussion, open creation
and editing policies started to be used at that time. [17].
But the current ICT trends do not only emphasize the social networking role of the
internet. The internet is now also a platform for communication among machines and
products. The internet offers a complete solution that goes beyond the potential and
possibilities of traditional manufacturing industries.
All these changes are very significant and this new wave of changes is called the
new, fourth industrial revolution. This revolution has started changes and movements
that have never been experienced in society before.
To better understand the term ‘fourth industrial revolution’, the main principles of
the previous three industrial revolutions should be remembered. The first industrial
revolution was based on steam power and mechanization in industry. The second
revolution was caused by electricity and mass production and connected with ‘hard
automation’. The third industrial revolution was based on computers and it was connected with ‘flexible automation’. Finally, the current fourth revolution is also based on
ICT but is associated with ‘cyber physical systems’.
Fields such as automation, robots or digitalization of everything are important, and
again the internet plays a key role – in the form of the internet of things (IoT) or rather
the internet of everything [5, 10]. This is the reason why we sometimes call the changes
J. Basl
of Industry 4.0, ‘factory 4.0’ [9] or ‘smart factory’ [19]. This designation is a continuation of the term ‘digital factory’ which has been used in previous years.
The basic principles of Industry 4.0 are therefore the connection of machines, work
pieces and systems, and businesses are creating intelligent networks along the entire
value chain that can control each other autonomously.
Industry 4.0 is a way to improve production processes, to increase the productivity
for batch size equal to 1, to reflect individual demands and short term wishes. It helps to
reduce lead time and time to market. It helps to reduce product development time and
ad-hoc networking within cyber-physical systems. It helps transparency in real time, to
make faster and more flexible decision making, to archive global optimization in
development and production.
Examples for Industry 4.0 could be machines which can predict failures and trigger
maintenance processes autonomously, or self-organized logistics which react to
unexpected changes in production. Cyber-Physical Systems (CPS) is integrations of
computation and physical processes [12, 20].
Industry 4.0 means making important efforts not only at a technological but also at
a national level. A good example has been set by the German government. The German
Federal Ministry for Education and Research currently offers 183 different documents
related to this topic. For example, there is a project of the future, ‘Assembly 4.0’ which
was presented with the project of the month award in 2016. Industry 4.0 was also
proposed and adopted as a part of the ‘High-Tech Strategy 2020 Action Plan’ of the
German government (Recommendations, 2013). The general expectation is a growth of
Industry 4.0 in Germany until 2020 by about 1.7% each year – mainly in chemistry,
manufacturing, ICT and farming.
Similar steps have been taken in other industrially developed countries like the
USA (in the ‘Industrial Internet’ document [9] and China (in the ‘Internet+’ document
[19] and in the ambitious plan ‘Made in China 2025’ [11]. The Chinese government
declares here that the country is aiming at Industry 4.0 implementation.
It is very important to note that the Czech government also strongly supports the
Industry 4.0 trends in the document ‘The national strategy Industry 4.0’ published in
September 2015 [13]. It was prepared and guaranteed by the Czech Ministry for
Industry and Trade. Not only technological trends are elaborated here, but the changes
in the labor market are highlighted as well.
3 Methodology – Formulation of the Aim and Research
This paper deals with a survey of the penetration of Industry 4.0 principles in Czech
companies. The important questions concern the role of the selected IT trends and
enterprise information system software applications within the Industry 4.0 framework
now and in the near future (between 2 and 5 years). The other questions ask about the
preparation of Czech companies for this new trend.
Enterprise Information Systems
The motivation for this survey was not only the current technological trends but the
published manufacturing studies oriented towards Industry 4.0 penetration – on the
global level [8] and on the national level in Germany [1, 3, 15].
The first of these surveys was the most significant. It was done by Infosys – a leader
in consulting, technology, outsourcing and next-generation services and by the Institute
for Industrial Management at the University of Aachen in Germany. The survey analyzed more than 400 companies in industrially highly developed countries – China,
France, Germany, the United Kingdom and the United States. It shows the level of
maturity of Industry 4.0 and the key findings of this study are as follows [8]:
• 85% of manufacturing companies globally are aware of the potential of technologies for increasing asset efficiency
• However, only 15% of enterprises surveyed have so far implemented dedicated
strategies to this end by analyzing data from their machines
• The research revealed that the largest improvements planned over the next five
years are in the areas of information interoperability, data standardization and
advanced analytics
• It is interesting that one fifth of companies believe that by 2020 will achieve
anything beyond recognizing the potential of the Industry 4.0 concept.
The results of the survey declare that from all the five analyzed countries (China,
France, Germany, the UK and the US); China is the leading innovator and has the
highest percentage of early adopters (57%). Germany is in fourth place with only 21%
of early adopters. The German attitude and its wide support of Industry 4.0 is a big
inspiration for the Czech economy and companies. There are many German investors
and owners of companies in the Czech Republic and there is also close business
cooperation between both countries, with a large volume of mutual exports.
Another reason for the survey described in this paper is an effort to obtain a more
detailed view of the current ICT trends that are somehow connected with ICT such as
mobile devices, clouds and big data on the one hand and ERP, MES, BI and APS
applications on the other. Last but not least, topics such as robots, smart logistics and
flexible production planning were also observed.
The further research question in this survey examines whether the strategies for
Industry 4.0 are being implemented at the proper level in Czech companies, in a way
comparable with the situation in Germany (as there is a high level of integration of
companies from both countries).
The main research questions in this survey are as follows:
(1) Are the main ICT trends (like cloud computing, big data and internet of things)
being applied in the current development of companies and is growth of their
penetration expected over the next 5 years?
(2) Do ERP systems play a main role as an integrate application software?
(3) Are the main trends of Industry 4.0 (such as robots or adaptive maintenance)
being applied in the current development of companies and is
(4) Do Czech companies already have a strategy for Industry 4.0?
J. Basl
4 Sample Description and Data Collection
To be able to answer the research questions, a special questionnaire form was created
which was made available for the companies on the website. Data collection was
carried out by completing the web form in June/July 2016.
A set of 169 companies was addressed by the survey. 24 companies answered,
meaning that the survey had a 14.7% response rate. It is important that the sample of
companies reflects well the profile of the whole Czech economy, because the majority
of firms belong to the automotive industry (29%) and mechanical engineering (25%).
The Czech Republic is, by the way, the most developed industrial EU country because
its share of GDP from industry is 47.3% (Germany has only 40.2%).
The companies that participated in the survey were mostly large companies with more
than 250 employees (66.7%) and middle sized companies (25%). There was also balanced
ownership of domestic investors (58.3%) and foreign investors and owners (41.7%). The
importance for the validity of the data was the fact that it was mostly the directors or
company owners (33.3%) and top managers (41.7%) who answered the questions in the
form. It is interesting that only 4.2% of IT managers answered the questions.
There is one more aspect that characterizes the survey. The majority of companies
that answered and participated in the survey declared that they have dealt with Industry
4.0 for either more than one year (41.7%) or are dealing with it right now (20.8%).
Only one fifth of the companies (20.8%) say that they know about this new trend but
they do not want to implement it. On the other hand, only 8.3% of companies declared
that they have not heard of Industry 4.0 so far.
Finally, there is one more interesting fact revealed by the survey. Those who
answered the questions spent roughly one hour filling it in. This confirms the importance of the role of Industry 4.0. The main results of the survey and the answers are
described below.
5 Research Results
The results from the survey as a whole reflect the trends in the application of ICT (such
as cloud computing and big data), IT supported technological trends (e.g. robots,
predictive maintenance, digital manufacturing and smart logistics) and usability and
integration of the traditional enterprise information systems packages (like ERP, MES
and CRM for example) in the Industry 4.0 trends and principles.
ICT Trends Penetration
The first research question: Are the main IT trends applied (like cloud computing, big
data, internet of things) in the current development of companies and is growth of its
penetration expected over the next 5 years?
The results show that all the selected topics are already being applied today. Cloud
computing mostly in first place, big data is second and the industrial internet of things
is in third position.
Enterprise Information Systems
It is interesting that demand and implementation of all these three trends is not
predicted to grow in the next five years, although their penetration is the basis for this,
and it is crucial.
IT trends
Industrial internet of
Cloud computing
Big data
Mobile devices
BYOD concept (bring
your own device)
Google glasses
Smart watches
Open software
Open protocols
Open data models
Planned to be used in
following 2 years
3 (13.0%)
Planned to be used in
following 5 years
1 (4.3%)
2 (8.7%)
4 (17.4%)
2 (8.7%)
4 (17.4%)
5 (21.7%)
1 (4.3%)
1 (4.3%)
2 (8.7%)
2 (8.7%)
1 (4.3%)
1 (4.3%)
2 (8.7%)
2 (8.7%)
2 (8.7%)
1 (4.3%)
2 (8.7%)
1 (4.3%)
The industrial internet of things is one of the key aspects of the whole new wave of
changes known as the 4.0 industrial revolution. Therefore, it is interesting that enterprises want to know more information about it – nearly one half of them (47.8%). The
other two main trends (cloud computing and big data) are also topics which need to be
further explained for nearly one third of the enterprises.
IT trends
Industrial internet of things
Cloud computing
Big data
Mobile devices
BYOD concept (bring your own device)
Google glasses
Smart watches
Open software architecture
Open protocols
Open data models
We would like get more information
11 (47.8%)
6 (26.1%)
7 (30.4%)
8 (34.8%)
11 (47.8%)
13 (56.5%)
13 (56.5%)
11 (47.8%)
13 (56.5%)
14 (60.9%)
J. Basl
New trends like Google glass (56.5%), smart watches (56.5%) and open protocols
(56.5%) are the topics which companies would most like to have further explained.
Enterprise Information Systems Packages Integration in the Industry
4.0 Perspective
The second research question: Do ERP systems play a main role as an integrate
application software?
The results confirm the key role of ERP systems (65.2%) in the integration of plans
of companies during the preparation for Industry 4.0. The next most important package
is MES application.
Enterprise information system application
ERP (Enterprise Resource Planning)
MES (Manufacturing Execution System)
APS (Advanced Planning and Scheduling)
PLM (Product Lifecycle Management)
WMS (Warehouse Management System)
BI (Business Intelligence)
BPM/BPMS (Business Process Management Suites)
Other SW applications:
Other SW applications mentioned included: CRM, technomatix, NX, voice technology and paperless documentation.
ICT Based Technologies Penetration in the Future of Industry 4.0
The third research question: Are the main trends of Industry 4.0 (robots, adaptive
maintenance) being applied in the current development of companies and is growth of
its penetration expected over the next 5 years?
The three most important IT based technologies are - robots (39.1%), digital factory
(30.4%) and predictive maintenance (30.4%). The higher demand for the following 2
years is expected for predictive maintenance (34.8%) and smart logistics (26.1%). 3D
printing is used extensively now and there are high expectations that it will be in use
even in 5 years’ time.
IT based
Digital factory
Used now
7 (30.4%)
3 (13.0%)
Planned to be used in
following 2 years
4 (17.4%)
4 (17.4%)
Planned to be used in
following 5 years
2 (8.7%)
5 (21.7%)
9 (39.1%)
3 (13.0%)
1 (4.3%)
Enterprise Information Systems
IT based
Industrial 3D
Smart logistics
Cybernetic data
Used now
6 (26.1%)
Planned to be used in
following 2 years
4 (17.4%)
Planned to be used in
following 5 years
4 (17.4%)
4 (17.4%)
7 (30.4%)
6 (26.1%)
8 (34.8%)
3 (13.0%)
5 (21.7%)
4 (17.4%)
The main topics for further education and requirements for additional awareness are
¨cybernetic data security (39.1%) followed by adaptive automation (34.8%), digital
factory, predictive maintenance and industrial 3D printing (30.4%).
IT based technologies
Digital factory
Adaptive automation
Industrial 3D printing
Smart logistics
Predictive maintenance
Cybernetic data security
We would like get more information
7 (30.4%)
8 (34.8%)
5 (21.7%)
7 (30.4%)
6 (26.1%)
7 (30.4%)
9 (39.1%)
Existence of an Industry 4.0 Strategy
The fourth research question is: Do Czech companies have a strategy for Industry 4.0?
A high percentage of Czech enterprises (39.1%) do not currently have a strategy for
Industry 4.0. Nearly the same percentage of enterprises is preparing such strategy
(30.4%). And finally, nearly 25%, it means each fourth company, already has a strategy
for Industry 4.0. This is very similar to the answers from firms in the global survey (by
Infosys, 2015).
Strategy for industry 4.0
We do not have a strategy for Industry 4.0
We do not have a strategy for Industry 4.0 now but we are
preparing it
We have a strategy for Industry 4.0 and it is a part of business
We have a strategy for Industry 4.0 but it is not a part of business
J. Basl
6 Conclusion
Industry 4.0 seems to be a topic with a high potential, especially at a time when the
digitalization of production and of life in general is increasing. The survey indicates
many similarities between Industry 4.0 penetration in Czech companies and in leading
developed countries.
The survey identified a big potential for further analyses and the obstacles preventing the wider application of Industry 4.0. The main reason for companies is that
there is little awareness of the issues of Industry 4.0 (73.3% declare this), and the
effects on business are unclear (40%). The high costs connected with implementing
Industry 4.0 (40%) are also cited as one of the obstacles.
The survey shows that companies perceive the level of penetration to be very low
and they also feel that there are as yet no proper methodologies or road maps for
implementing Industry 4.0. These aspects are again possible further areas for research.
Last but not least is the fact that it has a crucial influence on the labour market. It is
expected that many jobs will be lost because of Industry 4.0. This fact is more
important for the Czech Republic, which experienced the biggest growth of jobs after
the crisis, especially in manufacturing (OECD, 2015). It will therefore be necessary to
keep a balance within Industry 4.0 developments from both technological and social
1. Computer Sciences Corp: CSC - Studie Industrie 4.0: Ländervergleich Dach (2015). http://
assets1.csc.com/de/downloads/Ergebnisse_CSC-Studie_4.0.pdf. Accessed 27 Apr 2016
2. Doucek, P.: Human capital in ICT – competitiveness ind innovation potential in ICT. In:
IDIMT-2011, Jindřichův Hradec, 7.09.2011 – 9.09.2011, pp. 11–23. Trauner Verlag
Universitat, Linz (2012). ISBN 978-3-85499-973-0
3. Eisert, R.: Sind Mittelständler auf Industrie 4.0 vorbereitet? (2014b). http://www.wiwo.de/
unternehmen/mittelstand/innovation-readiness-index-sind-mittelstaendler-auf-industrie-4-0vorbereitet/10853686.html. Accessed 27 Apr 2016
4. Factory 4.0? Lab where you will sit in front of PC http://zpravy.aktualne.cz/ekonomika/
r*4e1ca8206cf611e58f1e002590604f2e/. Accessed 27 Apr 2016
5. Tao, F., Zuo, Y., Xu, L.D., Zhang, L.: IoT-based intelligent perception and access of
manufacturing resource toward cloud manufacturing. IEEE Trans. Ind. Inform. 10(2), 1547–
1557 (2014)
6. Gartner - Top 10 Strategic Technology Trends for (2016). http://www.gartner.com/
technology/research/top-10-technology-trends/. Accessed 27 Apr 2016
7. Global Trends 2030: Alternative Worlds, National Intelligence Council (2012). https://
Accessed 27 Apr 2016
8. Industry 4.0 - The State of the Nations, INFOSYS. http://images.experienceinfosys.com/
State_of_the_Nations_2015_-_Research_Report.pdf. Accessed 27 Apr 2016
Enterprise Information Systems
9. The Industrial Internet Consortium: A Global Nonprofit Partnership of Industry, Government
and Academia, 2014, http://www.iiconsortium.org/about-us.htm. Accessed 27 Apr 2016
10. Jing, Q., Vasilakos, A.V., Wan, J., Lu, J., Qiu, D.: Security of the internet of things:
perspectives and challenges. Wirel. Netw. 20(8), 2481–2501 (2014)
11. Kenedy, S.: Made in China 2025, Center for Strategic and International Studies (2015).
http://csis.org/publication/made-china-2025. Accessed 27 Apr 2016
12. Lee, E.A.: Cyber physical systems: design challenges. Technical report
No. UCB/EECS-2008-8, http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-8.
html. Accessed 27 Apr 2016
13. National Iniative – Industry 4.0, Ministery for Industry and Trade, September 2015. http://
www.spcr.cz/images/priloha001-2.pdf. Accessed 27 Apr 2016
14. OECD Science, Technology and Industry Scoreboard 2015, Innovation for Growth and Society
(2016). http://www.oecd.org/science/oecd-science-technology-and-industry-scoreboard-20725345.htm. Accessed 27 Apr 2016
15. Perspektive Mittelstand: Industrie 4.0 macht Mittelstand zu schaffen (2015). http://www.
6093.html. Accessed 27 Apr 2016
16. Premier of the State Council of China and Li, K.Q.: Report on the work of the government.
In: Proceedings of the 3rd Session of the 12th National People’s Congress, March 2015.
Accessed 27 Apr 2016
17. Report: Accessed: Research and Markets: Enterprise 2.0: Is It Time for Your Organization to
Make the Transition (2008) http://search.proquest.com/docview/446162456?accountid=
149652016-04-27. Accessed 27 Apr 2016
18. Soliman, F., Youssef, M.A.: Internet-based e-commerce and its impact on manufacturing
and business operations. Ind. Manag. Data Syst. 103(8–9), 546–552 (2003)
19. Wang S., Wan J., Li D., Zhang C.: Implementing smart factory of industrie 4.0: an outlook.
Int. J. Distrib. Sens. Netw. 2016, Article ID 3159805, pp. 1–10 (2016). http://dx.doi.org/10.
20. Wan, J., Yan, H., Liu, Q., Zhou, K., Lu, R., Li, D.: Enabling cyber-physical systems with
machine-to-machine technologies. Int. J. Ad Hoc Ubiquit. Comput. 13(3–4), 187–196
21. Xu, X.: From cloud computing to cloud manufacturing. Robot. Comput.-Integr. Manuf. 28
(1), 75–86 (2012)
Internet of Things Integration in Supply
Chains – An Austrian Business Case
of a Collaborative Closed-Loop
Andreas Mladenow1(&), Niina Maarit Novak2, and Christine Strauss1
Department of E-Business, Faculty of Business, Economics and Statistics,
University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
Institute of Software Technology and Interactive Systems, Vienna University
of Technology, Favoritenstr. 9-11, 1040 Vienna, Austria
[email protected]
Abstract. Although Internet of Things (IoT) applications have existed for
several years by now, only a very small number of these applications are
available on the market. Thus, there is still not much guidance on how to
describe the evolution of cross-organizational IoT integration in a systematic
manner. In this regard, we intend to make a contribution by analyzing a business
case, i.e. an ongoing IoT implementation of an Austrian retail trader. Hence, we
elaborate on the type of integration for a cross-organizational integration of IoT,
highlight the effects on companies and analyze the economic feasibility.
Keywords: IoT Closed-loop collaboration Supply chain RFID
1 Introduction
Recently, Cisco, a US based technology and software company, estimates that
approximately 14 trillion dollars can be earned with the Internet of Things (IoT) before
2022 [1]. Mobility [2, 3], globalization and a society with a very diverse consumption
and leisure behaviour cause increasingly complex of supply chains with a vast variety
of customized goods, partially manufactured at multiple locations [4] and marketed
worldwide [5]. This imposes particularly high demands on logistics providers [6] –
which act as connecting elements in our increasingly integrated world [7]. Systems
ought to be dynamic and flexible [8], but at the same time robust and cost-effective.
The IoT provides a completely novel control architecture, which meets the described
demand for highly flexible intralogistics systems and has thus the potential to shape the
future of logistics in general [9].
With IoT -applications and an increasing degree of self-organized supply chains, it
becomes possible for autonomously interacting objects and processes to increasingly
connect with the digital world of the Internet [10]. It is not a closed nor an independent
system, but it includes a variety of different technologies, including radio networks,
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 166–176, 2016.
DOI: 10.1007/978-3-319-49944-4_13
Internet of Things Integration in Supply Chains
sensors and networks, microprocessors, tracking and tracing (T&T) functions, Agent
based Systems, Smart Labels, quick response (QR) codes or radio-frequency identification (RFID) [11]. As far as logistics are concerned, these technologies mark a
paradigm shift [12]. Against this background, many years of intensive research and
development is being invested to implement suitable IoT-applications for the logistics
sector. As of today, the IoT as a technology is used partially and already offers significant benefits for users. T&T systems offer the recipients of parcels the service to
follow the progress and current status of a shipment while it is transported. This service
is made possible via unique barcodes or 2D-codes which are scanned at specific
transfer stations to uniquely identify a shipment, and to forward the current status
automatically to a central data-system. This information about the current status can
then be checked by the recipient of a parcel via a web-interface.
In the case of business collaborations [13], non-transparent and complex processes
along the supply chain make it sometimes difficult to evaluate the economics gains
of individual business partners [14, 15]. Often the interests between partners differ
greatly [16].
What are the potentials of IoT-based applications? Which holistic assessment
methods are used along the supply chain to evaluate its economic efficiency? The
corresponding evaluation deals with the performance-based distribution of costs and
benefits along the supply chain, and with the necessary determination of target values
for efficiency and resource allocation. Hence, the paper is structured as follows: the
next section elaborates on the type of integration for a cross-organizational integration
of IoT, highlights the effects on companies and deals with the economic feasibility.
Section 3 analyzes an ongoing integration case, based on Siggelkow [17]. Afterwards,
Sect. 4 discusses relevant aspects. Section 5 summarizes the major findings of our
ongoing research and gives a short outlook on future areas of research and likely
2 Background
The demand to introduce cross-organization IoT-applications occurs foremost towards
the end of the supply chain, at the retailer [18]. From an economical point of view, a
decision in favor of IoT-applications leads at the same time to disadvantages resulting
from a loss in subsidization for other important and strategic business projects related
to achieving superordinate goals such as improving customer service satisfaction,
innovation leadership [19] or in general company reputation [20].
Type of Integration
The collaborative closed-loop IoT-Appliction allows for a cross-organizational implementation of systems and software along a supply chain, enabling access to all relevant
data in order to ensure smooth operations and control of processes. These inherent
features make it obvious that the use of collaborative closed-loops is particularly suitable
for delivery networks, which are characterized by an intensive flow of materials.
A. Mladenow et al.
It allows for a particularly reliable, automated and real-time-based control of the material
flow. Moreover, the IoT-application RFID is used at this level e.g. in the area of
container management. These two applications are examples of the typical applications
fields along a value-added network, namely trading networks with an exceptionally wide
range of products or production networks with a particularly high number of individual
The open-loop system represents the most intensive interrelation across company
boundaries. In addition to the integration of several companies as well the end user is
included. For the end-user the integration into the collaborative closed-loop network is
particularly beneficial as this allows to verify the origin and authenticity of a product.
This is especially important in economic sectors (e.g. the pharmaceutical industry) with
a fundamental and growing interest regarding product control and traceability. The
open circuit is not only beneficial to the consumer, but has positive effects on the entire
supply network. This is because the technology significantly reduces the workload for
example in case of complaints or product recalls.
External and Internal Effects on Companies
Related to the cross-organizational implementation and use, external and internal
effects on the company have to be differentiated. While in the case of internal effects
the potential benefits have a direct influence on the companies’ processes, no direct
relationship can be observed in the case of external effects. Through the automation
effect a rationalization of manual work occurs in connection with data collection. The
associated information effect guarantees that the company receives reliable information
about the current position and state of items throughout the network at all time. As a
consequence of the information effect another effect occurs, which is the transformation
effect. This effect allows re-arranging processes in a more effective manner and leads to
further reduced inventory costs. Improved transparency enables reductions in costs and
inventory (especially safety stock as stock-levels and item locations can be monitored,
as well as expected delivery times along the entire value-added chain without reducing
the delivery service level).
Potential benefits can be triggered by both internal and external effects. In the case
of external effects there are two main benefits, which positively influence project
profitability, namely increased efficiency and savings [21] (in terms of fines resulting
from various contract infringements). A substantial reason for this is the technically
supervised commissioning of goods, which further enables an error free delivery. This
has immediate positive effects on the number of incorrect deliveries, and thus on the
number of fines for wrong or delayed deliveries. Another indirect cost reduction are
regress payments which can be prevented through the complete documentation of
manufacturing processes and the possibility to verify which materials and components
have been used. This is particularly important for industries using voluntary or obligatory safety and authenticity certificates, e.g. in pharmaceutical industries, as well as
other industries were safety regulations are of particular importance. At the same time
product counterfeiting can be detected easier and faster, which in turn increases
product-image safety. Moreover, administrative expenses in connection with product
recalls can be either completely eliminated or at least reduced.
Internet of Things Integration in Supply Chains
Another external effect is the increase of efficiency and revenues caused by direct
and indirect effects. Especially the shortening of processing times leads to increasing
process efficiency. The avoidance and/or decrease of incorrect deliveries, further affects
the reputation of the enterprise in a positive way, and thus contributes to an increase of
customer orders.
The internal effects can be divided into three categories: (i) effects on upstreamprocesses, (ii) effects on downstream-processes, and (iii) effects on processes in other
enterprises of the network. All three internal effects have in common that their
potentials are exhausted when seizing the process components. At this point a superordinate potential is realized, as the collection and simultaneous generation of events is
from now on done fully automatically rather than by hand.
The first type of internal effects is characterized by a positive influence on
upstream-processes in a certain area, for example when taking a component from stock
during the assembly process of a product, a production order is automatically triggered
and forwarded to ensure that a new component is placed into stock in time. In the case
of upstream-processes this results in less planning efforts as well as in a higher degree
of machine utilization. The second internal effect, affects all downstream-processes.
Thus, for example, with the acceptance of a new order, all downstream-processes are
“informed” about the start of the manufacturing process of the new product, simply
through the generation of a new event in the IoT-application. Following this, automatic
actions are taken at the downstream-process locations, which increases process efficiency and leads to cost and time savings. This makes it possible that at the time of
accepting a new order, necessary preparation procedures, e.g. related to the manufacturing process, could immediately start which avoids latencies and idle times. The third
type of internal effects refers to various potential benefits realizable in the sphere of
other partner enterprises in the supply chain.
All three types of internal effects have thereby a cost-minimizing influence on data
management, the flow of material as well as on the degree of capacity utilization of the
required production facilities. Savings related to data management result from automated data generation and data supply. The same is true for the flow of materials, where
automated processes lead to a rationalization of manual, expensive and inefficient
activities, and thus to a more efficient use of existing resources. This leads to better
process planning and ultimately to a reduction of production resources. Moreover, due
to an increased efficient use of the equipment further costs for example for storage or
transportation vehicles can be reduced. In contrast to external effects, the identification
of benefits related to internal effects greatly depends on the chosen IoT implementation,
which is the reason why these effects cannot be identified easily on a general level.
Economic Feasibility for the Integration of Cross Organizational
Regarding economic feasibility, a clear distinction for reallocation is made between
both, performance-based as well as non-performance-based strategies for cross organizational IoT integration.
A. Mladenow et al.
The performance-based strategy contains a value-dependent profit allocation: in this
case, the generated profit of a project is shared among the participants based on the
contribution (value-added share) of each contributor to the product. This strategy
implies that companies that do not perform direct value-adding activities, such as
logistics service providers, are neglected during the profit allocation process. This
problem can only be solved via individually negotiated profit-sharing allocations. At
the same time, the expenses incurred by a company cannot be directly concluded from
the company’s share of value-added to a product. Therefore companies may suffer
losses, if the costs are greater than the share of value-added.
Furthermore, the performance-based strategy implies an identical return on
investment. This is a redistribution strategy, where each partner generates the same
return on investment. This means that the profit allocation is based on the expenses
incurred. Due to this expense-oriented calculation of return on investment, each
company receives exactly the return of investment of the entire project. The advantage
of this approach is thus to have a consistent return of investment throughout the entire
network, which in turn increases the chances of a fast agreement between different
company-partners and enables a fast implementation of the project.
On the other hand, the non-performance based strategy entails the compensation of
losses. Those companies that experience any losses due to a certain project receive
compensation in the same amount. This implies that at least one partner benefits, as the
positive net present value of the project is allocated to that partner due to the zero profit
of the others thus ensuring that all of the partners incur equal consequences.
In addition, the non-performance based strategy implies identical target allocation.
The aim of this approach is to redistribute the relative impacts of the project (i.e.
comparison of costs and benefits, which can only be positive, as the project wouldn’t
have been implemented otherwise), among the project partners, in such a way that each
company generates the same profit in that specific project. In the case of a positive
network effect, the total profit increases with the number of participants – e.g. as it is
the case in many supply chains - this implies a win-win situation.
Besides, there is an identical profit allocation. The profits of all partners of the
project are summed up, and subsequently evenly redistributed to all project partners,
regardless of other factors. This approach might thus lead to losses for some of the
project partners, provided that the received profits do not cover the expenses incurred.
Therefore, this approach entails a non-negligible probability of ending up in a win-lose
situation for some of the project-partners, which in turn may result in refusing to
participate in the project.
In order to distribute costs and benefits in a fair manner along the supply chain a
cost-benefit-sharing approach may be performed. By means of introducing a neutral
instance, a basis for detecting and documenting company-specific and network-related
costs and benefits, is being established. This contributes proposals for a fair distribution
of investments and/or savings, without disclosing company-specific key figures to
partner companies.
What are the costs and benefits of individual companies using standardized process
modules and impact factors such as internal transport, material processing, storage and
Internet of Things Integration in Supply Chains
retrieval? Indicators which are worth reviewing include the overall profitability, cash
flows and valuation records. Indicators which are relevant for forecasting include
structural as well as quantity related indicators, efficiency indicators, quality metrics, and
IoT impact indicators. Best practices of IoT-applications such as RFID include [20]:
• identification of relevant impact factors in terms of costs, and definition of standardized process modules for overall supply chain processes
• allocation of relevant factors to costs and benefits.
• determination of target values for the assessment and evaluation of a cost model
• integration of the identified benefit for companies in the cost model
• individual evaluation of the economic efficiency of the use of IoT-applications for
companies involved in the value-added process
• balancing of interest-asymmetries by methods of cost-benefit-sharing and by
including a neutral instance.
3 Analysis of an Austrian Retail Trader Business Case
Based on Siggelkow [17], we analyse an ongoing case of a cross-organizational, “things
orientated” [10] IoT integration. An Austrian retail trading company has implemented a
collaborative closed-loop RFID system with its tier-1 supplier as depicted in Fig. 1.
RFID has been adopted for pallet-level tagging between tier-1 suppliers (make) and
retail traders. Moreover, it is presently being discussed how additional IoT-applications
could be integrated on an RFID item-level tagging system to both the customer and
tier-2 suppliers within the supply chain for a more efficient use of available resources
and higher performance.
Fig. 1. Cross-company collaborative closed-loop implementation of an IoT-application
Furthermore, the concept of Vendor Managed Inventory (VMI) is applied in parallel to a cross-organizational use of IoT-applications by multiple, cooperating business
partners. VMI serves as an incentive instrument for the supplier.
A. Mladenow et al.
Moreover, for the tier-1 supplier it is obligatory to buy back all products that could
not be sold from the retailer at full price. Therefore, the supplier has to take the entire
risk in case of overproduction. Thus, the agreement has a similar effect like an incentive
for retailers to increase their inventory to avoid stock-outs. Higher inventory levels
imply less stock-outs – ultimately this is beneficial for the entire supply chain and thus
also for customers.
In the case of VMI the manufacturer is responsible for the inventory management
of the retailer. Therefore, the retailer pays only for those goods which have been
actually sold. The supplier is thus forced to cover any costs in order to make
improvements in his own warehouse – not in the warehouse of the retailer. In this
context, it is often necessary to acquire the necessary and appropriate competencies,
infrastructure and know-how at first.
In addition, the customer´s demand of goods depends on the price; and the retailer’s
(who in turn is a customer of his supplier) demand depends on the demand of his
customers and the price. Hence, the risk reduction effect caused by a sub-optimal
inventory level (higher or lower than the optimal inventory level) – lost profits and/or
turnover or sunk costs – causes automatically a reduction of costs on the retailer’s side,
which in turn reduces the prices for goods. Consequently, the consumer’s price elasticity increases the retailers´ demand of goods, which in turn is beneficial for the entire
supply chain.
Such contractual agreements, however, do not only influence the rights of disposal
in the retailer’s warehouse, but also the quality of inventory management. The application of IoT in a VMI-implementation may not only generate cost advantages, but
may also positively affect the relationship between the contracting parties, the efficiency of inventory management, the accurateness of the order policy, as well as the
automatic recognition of expired products. Especially for suppliers who sell multiple
products, an efficient inventory management system is of utmost importance; such
system supports efficient use the shelf-space provided by the retailer, Furthermore, the
supplier is able to improve his turnover in a specific shop or location. VMI particularly
enables and suggests the display of additional products or an alternative product-mix at
a certain retailer.
In the present case, the use of an IoT-application seems to be disadvantageous for
the supplier due to various reasons, leading to the fact that the use of IoT is less
profitable for the supplier than for the retailer without a pre-defined, appropriate
compensation for the supplier. When receiving goods, the suppliers’ effort can be
reduced considerably.
Even once a production process has finished, there are hardly any benefits or
savings for the supplier due to the fact that many suppliers produce only one single
product-type or a limited number of different products. Furthermore, products are
typically packed and stored in standardized dimensions, quantities and carriers. In
addition, suppliers typically supply several different customers (i.e. wholesalers and
retailers), of which, however, only few use the IoT-application. Thus, even though the
supplier takes over a large part of the IoT-application’s costs, he hardly ever has the
possibility to benefit from the full potential of the technology in terms of profit. Unlike
the suppliers, retailers benefit from IoT-applications from the very beginning. For
example, activities associated with stock receipt can be significantly reduced during
Internet of Things Integration in Supply Chains
receipt (i.e. bulk dispatch): once the goods are received, the IoT-technology allows for
further simplifications for example in connection with the storing process in form of
automated storage space allocation and its registration in the database.
The use of IoT-systems further supports the order-picking and outsourcing process
not only through the automatic registration of products, but also by providing a security
control function [22] during the loading of ordered goods, which automatically
re-checks whether the right goods are being loaded into the right truck. This reduces the
number of incorrect deliveries and commissioning, which ultimately allows for cost
savings associated with the returns of goods and redeliveries. Even before the receipt of
goods or after a successful delivery of goods the retailer may profit from the use of a
IoT-application. Using real-time tracking and condition monitoring (e.g., temperature
control for perishable goods), it is possible to detect delays or cooling unit failures at an
early stage, which allows the person in charge to react in time and to find appropriate
alternative solutions [23].
Moreover, the Electronic Product Code (EPC)/RFID-Calculator, an open source
Tool (based on visual basic for excel), is used to determine the economic feasibility for
the cross-organizational IoT-application integration [24].
4 Discussion
IoT-implementations are increasingly applied in cross-organizational supply chains
[25]. Worldwide, millions of items are being moved every day, ranging from huge
containers, over pallets to individual parcels. In order to ensure a smooth processing, it
is of outmost importance that items arrive at the right place at the right time [26].
Manual processes (such as counting of containers before they are loaded on board of an
aircraft), should be increasingly replaced by self-controlled systems and thus by smart
IoT-applications which are integrated and therefore connected to each other [27].
Similar to internet data packets, when using IoT-applications shipments control and
navigate through the logistics networks by routing themselves on their own [28]. The
logistics service provider is subsequently informed that all shipments are being
transported [29]. From the perspective of a company this results in a considerable
increase of efficiency [30]. However, these economic gains in supply chains are not
always equally distributed among the players in the supply chain [31].
Concerning economic feasibility, a distinction should be made between personnel
expenses and expenses for resources. In addition, there is a distinction between
one-time and current expenditures. The use of time and resources related to a specific
process are identified and evaluated. The identified effects are quantified; expenditures
and potential benefits must be documented accordingly. In particular, expenditures
related to the use of technology and which might cause additional cost must be documented and shown. The result is a cost model focusing on capital value and payback
time, which is used to determine the profitability in case of uncertainty. This allows to
assess the overall effect of use across organizational borders. In addition, an expansion
analysis can be carried out, which analyzes the essential factors for the use of
IoT-applications such as RFID [20].
A. Mladenow et al.
Atzori et al. [10] distinguish between “semantic”-, “internet”-, and “things”orientated IoT visions. In this regard, partners of cross-organizational collaborations
already benefit from new promising developments in these areas [32]. On that score,
the shared information hub (SIH) and the total supply chain visibility (TSCV) support
IoT-applications for supply chains.
SIH is a tool aimed at simplifying and improving data exchange, providing a
solution for the typical problems of supply chain management [33]. Due to a lack of
data exchange, individual players in a supply chain usually perform their business
planning on different assumptions, which result in a gradual and downstream domino
effect (“informational Domino Principle”). Accordingly, the so-called bullwhip effect
can be mitigated or even prevented [34]. Using the SIH allows for cross-organizational
and integrated planning in the context of supply chain management, based on a shared
data pool. This approach is made possible through the use of IoT-applications, which
provide real-time and automated information to all parties involved along the entire
supply chain. In this way potential delivery delays, problems or even failures can be
detected early on (e.g., the failure of a container’s cooling unit in the food industry),
long before it would have been possible with a conventional communication strategy.
This allows that supply chain members can react early on and with the purpose of
finding alternative and appropriate solutions in time.
The TSCV approach is an ICT solution allowing members of a complex value
network to improve, intensify and automate their communication among each other.
This makes it possible to restructure, improve and cut out previously inefficient process
steps, which results in a more efficient use of available resources and higher performance. In addition, the improved coordination and communication between the business partners leads to a closer degree of collaboration and integration along the supply
chain; this may positively affect the level of trust. Especially with regard to cooperation
and joint projects, these network effects and ties play an important role for
cross-organizational collaborations.
5 Conclusion and Outlook
In this paper we analyzed the integration of a “things”-orientated IoT-application [10]
for an Austrian cross-company collaboration. The retailer perceives the IoT-application
on the supplier side as a competitive advantage compared to other suppliers. For the
retailer the use of the IoT-technology implies both cost savings and a greater performance level. For these advantages retailers are willing to pay a greater price to suppliers or even decide to cooperate exclusively with suppliers that utilize
At present, the integration of cross-organizational IoT-applications is still in its
infancy, as an enormous number of novel applications emerge on the market. Consequently, numerous concepts are implemented, which lead to a reorganization and
new division of tasks and competencies of both manufacturers and their customers. In
terms of cross-organizational collaboration, increased benefits can be achieved only if
IoT-applications are being used across organizations, and only if the costs can be
allocated to the individual partners of the supply chain. In this regard, a survey with
Internet of Things Integration in Supply Chains
multiple retailers shall be conducted to obtain quantified results that can be used as a
basis for future research. Hence, scenarios should be examined appropriately and
accordingly to keep the risk of an IoT-implementation as low as possible.
1. Columbus, L.: In Forbes. http://www.forbes.com/sites/louiscolumbus/2015/12/27/roundupof-internet-of-things-forecasts-and-market-estimates-2015/#e7af8c848a05. Accessed 25 Aug
2. Mladenow, A., Novak, N.M., Strauss, C.: Mobility for ‘Immovables’–clouds supporting the
business with real estates. Procedia Comput. Sci. 63, 120–127 (2015)
3. Hyben, B., Mladenow, A., Novak, N.M., Strauss, C.: Consumer acceptance on mobile
shopping of textile goods in Austria: modelling an empirical study. In: Proceedings of the
13th International Conference on Advances in Mobile Computing and Multimedia, pp. 402–
406. ACM, December 2015
4. Mladenow, A., Bauer, C., Strauss, C., Gregus, M.: Collaboration and locality in
crowdsourcing. In: 2015 International Conference on IEEE Intelligent Networking and
Collaborative Systems (INCOS), pp. 1–6, September 2015
5. Cheng, Y., Tao, F., Xu, L., Zhao, D.: Advanced manufacturing systems: supply–demand
matching of manufacturing resource based on complex networks and Internet of Things.
Enterp. Inf. Syst. 1–18 (2016)
6. Mladenow, A., Bauer, C., Strauss, C.: Crowdsourcing in logistics: concepts and applications
using the social crowd. In: Proceedings of the 17th International Conference on Information
Integration and Web-Based Applications and Services, p. 30. ACM, December 2015
7. Mladenow, A., Bauer, C., Strauss, C.: ‘Crowd Logistics’: the contribution of social crowds
in logistics activities. Int. J. Web Inf. Syst. 12(3), 379–396 (2016). Elsevier
8. Makarova, T., Mladenow, A., Strauss, C.: Barrierefreiheit im Internet und Suchmaschinenranking – eine empirische Untersuchung. Informatik 2016. Nr. 259, Lecture Notes in
Informatics, Gesellschaft für Informatik, Köllen Druck. Bonn, pp. 1071–1086 (2016)
9. Speranza, M.G.: Trends in transportation and logistics. Eur. J. Oper. Res. http://tinyurl.com/
10. Atzori, L., Iera, A., Morabito, G.: The internet of things: a survey. Comput. Netw. 54(15),
2787–2805 (2010)
11. Whitmore, A., Agarwal, A., Da Xu, L.: The internet of things—a survey of topics and trends.
Inf. Syst. Front. 17(2), 261–274 (2015)
12. Guo, B., Zhang, D., Wang, Z., Yu, Z., Zhou, X.: Opportunistic IoT: exploring the
harmonious interaction between human and the internet of things. J. Netw. Comput. Appl.
36(6), 1531–1539 (2013)
13. Mladenow, A., Bauer, C., Strauss, C.: Collaborative shopping with the crowd. In: Luo, Y.
(ed.) CDVE 2015. LNCS, vol. 9320, pp. 162–169. Springer, Heidelberg (2015). doi:10.
14. Werner, H.: Supply Chain Management: Grundlagen, Strategien, Instrumente und Controlling. Springer, Heidelberg (2013)
15. Beamon, B.M.: Measuring supply chain performance. Int. J. Oper. Prod. Manag. 19(3), 275–
292 (1999)
16. Bensel, P., Gunther, O., Tribowski, C., Vogeler, S.: Cost-benefit sharing in cross-company
RFID applications: a case study approach. In: ICIS 2008 Proceedings, p. 129 (2008)
A. Mladenow et al.
17. Siggelkow, N.: Persuasion with case studies. Acad. Manag. J. 50(1), 20 (2007)
18. Attaran, M.: Critical success factors and challenges of implementing RFID in supply chain
management. J. Supply Chain Oper. Manag. 10(1), 144–167 (2012)
19. Novak, N.M., Mladenow, A., Strauss, C.: Avatar-based innovation processes-are virtual
worlds a breeding ground for innovations?. In: Proceedings of International Conference on
Information Integration and Web-based Applications and Services, p. 174. ACM, December
20. Irrenhauser, T., Reinhart, G.: Evaluation of the economic feasibility of RFID in the supply
chain. Prod. Eng. 8(4), 521–533 (2014)
21. Widhalm, M., Mladenow, A., Strauss, C.: E-Appointment Plattformen zur Effizienzsteigerung und Umsatzgenerierung–eine Branchenanalyse. HMD Praxis der Wirtschaftsinformatik 52(3), 401–417 (2015)
22. Mladenow, Andreas, Novak, Niina, Maarit, Strauss, Christine: Online Ad-fraud in search
engine advertising campaigns. In: Khalil, Ismail, Neuhold, Erich, Tjoa, A,Min, Da Xu, Li,
You, Ilsun (eds.) CONFENIS/ICT-EurAsia -2015. LNCS, vol. 9357, pp. 109–118. Springer,
Heidelberg (2015). doi:10.1007/978-3-319-24315-3_11
23. Richards, G., Grinsted, S.: The Logistics and Supply Chain Toolkit: Over 100 Tools and
Guides for Supply Chain, Transport, Warehousing and Inventory Management. Kogan Page
Publishers, London (2016)
24. EPC/RFID-Calculator. http://tinyurl.com/znl4ryt. Accessed 01 Aug 2016
25. Li, S., Da Xu, L., Zhao, S.: The internet of things: a survey. Inf. Syst. Front. 17(2), 243–259
26. Christopher, M.: Logistics and Supply Chain Management. Pearson Higher Ed., New York
27. Welbourne, E., Battle, L., Cole, G., Gould, K., Rector, K., Raymer, S., Borriello, G.:
Building the internet of things using RFID: the RFID ecosystem experience. IEEE Internet
Comput. 13(3), 48–55 (2009)
28. Uckelmann, D., Harrison, M., Michahelles, F.: An architectural approach towards the future
internet of things. In: Uckelmann, D., Harrison, M., Michahelles, F. (eds.) Architecting the
Internet of Things, pp. 1–24. Springer, Heidelberg (2011)
29. Bi, Z., Da Xu, L., Wang, C.: Internet of things for enterprise systems of modern
manufacturing. IEEE Trans. Ind. Inform. 10(2), 1537–1546 (2014)
30. Kärkkäinen, M.: Increasing efficiency in the supply chain for short shelf life goods using
RFID tagging. Int. J. Retail Distrib. Manag. 31(10), 529–536 (2003)
31. Gelsomino, L.M., Mangiaracina, R., Perego, A., Tumino, A.: Supply chain finance: a
literature review. Int. J. Phys. Distrib. Logist. Manag. 46(4), 348–366 (2016)
32. Leminen, Seppo, Westerlund, Mika, Rajahonka, Mervi, Siuruainen, Riikka: Towards IOT
ecosystems and business models. In: Andreev, Sergey, Balandin, Sergey, Koucheryavy,
Yevgeni (eds.) NEW2AN/ruSMART -2012. LNCS, vol. 7469, pp. 15–26. Springer,
Heidelberg (2012). doi:10.1007/978-3-642-32686-8_2
33. Usländer, T., Berre, A.J., Granell, C., Havlik, D., Lorenzo, J., Sabeur, Z., Modafferi, S.: The
future internet enablement of the environment information space. In: Hřebíček, J., Schimak,
G., Kubásek, M., Rizzoli, A.E. (eds.) ISESS 2013. LNCS, vol. 413, pp. 109–120. Springer,
Heidelberg (2013). doi:10.1007/978-3-642-41151-9_11
34. Rogers, H., El Hakam, T.A., Hartmann, E., Gebhard, M.: RFID in retail supply chains:
current developments and future potential. In: Dethloff, J., Haasis, H.-D., Kopfer, H.,
Kotzab, H., Schönberger, J. (eds.) Logistics Management. LNL, pp. 201–212. Springer,
Heidelberg (2015). doi:10.1007/978-3-319-13177-1_16
Application of the papiNet-Standard
for the Logistics of Straw Biomass in Energy
Jussi Nikander(B)
Natural Resources Institute Finland, Vihti, Finland
[email protected]
Abstract. Multi-fuel solutions are an increasingly common set-up in
CHP (Combined Heat and Power) plants. Many use also different types
of biofuels, such as wood or agricultural products. In Finland, the most
prominent type of biofuel in CHP are forestry products, with agricultural
biofuel playing only a marginal part. This work investigates the use of
the papiNet standard, originally designed for the forestry supply chain,
as a possible data exchange format for a multi-fuel supply chain where
forestry products are the dominant type of fuel. In the work a model
for the data exchange between different actors in the supply chain is
described, and the application of papiNet in it is explored. As a result,
the papiNet standard is found to be suitable for use with some provisions.
Keywords: Multi-fuel supply chain · Supply chain
logistics · Bioenergy ecosystem · Logistics chain
One current trend in energy production is the use of biomass in CHP (Combined Heat and Power) power plants. Such systems sometimes use one specific
type of biomass, such as wood, but often the systems are designed multi-fuel
systems. Common are also solutions where biomass is co-fired together with fossil fuels such as coal [18]. The types of biomass used in energy production can
be characterized in several ways. McKendry divides the material by type into
woody plants, grasses, aquatic plants, and manure, with grasses further subdivided according to moisture content [10]. Another way to categorize biomass
is to divide it into crops and wastes, with waste coming from three sources:
forests, agriculture, and municipal waste [12]. The latter categorization is more
suitable for the purposes of this paper, as many biomasses used in CHP energy
production are created as by-product or waste product of some other process,
such as felling waste or sawdust. There are, however, agricultural biomasses that
are grown for use in CHP energy production, such as reed canary grass [2,7].
However, despite research and development efforts, cultivation of such energy
crops is currently not a particularly wide-spread phenomenon [6]. In 2007 the
c IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 177–190, 2016.
DOI: 10.1007/978-3-319-49944-4 14
J. Nikander
total estimated area for energy crop cultivation in the whole EU was approximately 2.5 million [1,4] to 5.5 million [16] hectares out of approximately 109
million hectares. The majority of this area was used to grow crop for biofuel
production, such as rape [4,16]. Thus, in order to use agricultural biomass in
any significant degree in CHP power plants, by-products, such as straw, need to
be utilized.
Forest biomass, as well as solid agricultural biomass, such as straw or reed
canary grass, can be directly used in common CHP plants. In case of co-firing
plants, different types of fuel are typically mixed together with each other. Such
multi-fuel power plant requires the management of multiple different fuel sources,
and thus has a need for sophisticated information systems to manage the different supply chains and ensure the availability of fuel. In Finland the most
common type of biomass in energy production is forest biomass. However, there
is significant interest in exploring the potential of other sources of biomass in
energy production, and how these sources could be easily integrated to the current supply chain, which is focused on forest products.
This paper concentrates on the information management of the supply chain
logistics of straw biomass in a multi-fuel environment in Finland. The main focus
is in investigating suitable means for the different actors involved in the supply
chain to be able to exchange data in a machine readable and standard manner.
In Finland, the forestry supply chain, which includes the supply of forest biomass
for energy production, uses the papiNet standard [17] for data exchange. The
goal of this work was thus to investigate whether papiNet could be used to
also handle the data exchange needs of the straw biomass supply chain, or if
another data exchange format is needed. The research question in this work is
“How feasible it is to use forestry data exchange standards to handle straw supply
chain information management for multi-fuel biomass power plants”.
Materials and Methods
The forestry supply chain is an area of bioeconomy where the integration of
software systems and automatic data exchange between different actors is welldeveloped. In the Nordic countries one reason for the level of development is that
the supply chain is dominated by a relatively small number of large forest industry companies1 . Their central market position puts these companies in a position
to make industry standards for the supply chain. Forestry logistics has at least
two important, internationally supported standards: StanForD and papiNet. The
StanForD (Standard for Forest machine Data and Communication) standard
maintained by Forestry Research Institute of Sweden is meant for data storage
and exchange between computers in forestry machines [20], whereas the papiNet standard maintained by the papiNet initiative is meant for data exchange
between actors in the forest and paper products supply chain [17]. In addition
to these two, there are other standards used in various nations for a number
of different purposes, such as buying and selling timber, or maintaining forest
Metsä group, UPM, and Stora-Enso in Finland.
papiNet-Standard for Straw Biomass Logistics
resource maps. As an example, the various national forestry standards used in
Finland can be found on a single web page2 .
The current versions of the StanForD and papiNet standards have been
implemented using XML, which is currently a very commonly used method in
business to business interoperability [5,14,15]. The papiNet standard attempts
to cover the supply chain of forestry products as well as possible, and has been
adopted for use in several countries [3,11,13]. The standard covers data transfer needs for different parts of the supply chain from business activities, such
as requests for availability or complaints, to measuring, storing and moving the
products. The standard also explicitly covers eight product categories ranging
from forest wood to book manufacturing and pulp [17].
Currently, the use of straw in energy production in Finland is marginal [6]
despite active research on the topic [19]. However, there is significant potential in
energy production with straw, and in Nordic countries there are existing power
plants using straw, such as the Fyn Power Station in Denmark [19]. Fyn creates
24.5 MW of electricity and 84MW of heat annually by using 170 000 tons of
straw as fuel. Total annual straw harvest in Denmark is 5.5 million tons [1].
Currently in Finland, the most ambitious straw power project is the bio–ethanol
production plant under construction at Kouvola, Finland, which would consume
330 000 tons of straw to produce 72 000 tons of ethanol3 .
Typically in straw supply chain it is assumed that the straw is baled upon
harvest and the supply chain will handle bales. There have been attempts at
transporting unbaled straw in Finland, but experiments showed it to be a worse
option than using bales. Typical Finnish balers produce round bales. Rectangular
balers, which would produce bales more suitable for the supply chain due to their
shape, are rare due to price, weight, and low demand. However, with rectangular
bales there would be less wasted space during transport.
This work was conducted as part of the Finnish strategic research project
BEST - Sustainable Bioenergy Solutions for Tomorrow4 . Part of the BEST
project was the development of a biomass virtual terminal, an integrated software solution for managing lots of biomass stored in various locations ranging
from large, permanent storage stacks to small, transient roadside piles. The initial goal of the virtual terminal was to be able to meet the future requirements
of forestry biomass supply chain in Finland. However, there was a strong need to
expand the terminal to also include other types of biomass. Straw was selected
as the first case for the expansion of the concept.
As part of the research on the virtual terminal concept, the flows of data
between different forest supply chain actors in the Finnish biomass logistics chain
were modeled. The constructed model was then reviewed and refined in workshops. This forest biomass model was used as the basis for the inclusion of straw in
the virtual terminal. In addition, previous work on the use of straw in energy production was mapped using a literature review and expert interviews. This work
acted as basis on the development of a data model for straw supply chain.
J. Nikander
Based on the information gained from the forest biomass data model and the
expert interviews, the forestry model was modified for handling straw biomass.
The straw model was then reviewed and refined using expert opinions, and the
information that needed to be transferred between different actors, systems and
processes was extracted from the model. The result of the extraction was then
used to further review and develop the model. Finally, examples of the data
messages included in the model were constructed using the papiNet standard.
The papiNet examples constructed were compared against the model and the
information exchange needs in order to analyze the suitability of the papiNet
standard for the task. In the analysis, special emphasis was placed on the coverage of the standard, mainly can it be used to depict all the information required,
and the suitability of the standard, mainly can all the information required be
expressed in a concise and useful manner.
The main result of this work is the model of the communication between different actors and processes in the straw supply chain for energy production, and
the analysis of this model. Based on the model and the defined data transfer
needs, examples of communication messages using the papiNet standard have
been developed. The model has been analyzed using expert reviewers, and its
suitability for purpose has been assessed. The model is depicted in Figs. 1 and 2.
The notation in the figures is inherited from the related forestry model, and it
has been implemented using the Altova UModel diagram tool.
The black rectangles in the Figures depict the actors and systems involved
in the model. There are three primary actors: the energy company, the supply
chain, and the farmer. The energy company’s ERP system and farmer’s Farm
Management Information System are also depicted separately in the Figures.
The goal of the energy company is to procure fuel for their power stations in
order to ensure that the stations can always produce the desired amount of
energy, both electricity and heat. The supply chain has been hired by the energy
company to transport straw from the farmers to the power stations. The farmers
have been contracted by the energy company to cultivate cereals and sell their
straw for energy. In a typical situation there is one energy company and numerous
farmers involved in the process. The number of supply chain actors is at least one,
but can be more. Different actors can, for example, be responsible for different
geographical areas.
The different systems and processes in Figs. 1 and 2 are depicted using two
elements that represent different processes in the model. The yellow rectangles
are processes that communicate with other processes in the model, and blue
rectangles are data management systems or processes that do not directly communicate with other parts of the model. There is one such detached process
in Fig. 2: the process of cultivating a farm. The numbered steps in the Figures
depict various stages of the straw supply management. The arrows in the model
depict communication from one process to another, where each arrow depicts
papiNet-Standard for Straw Biomass Logistics
one exchange of data. Finally, both the farmer and the energy company have
their own information systems that interact with their processes.
Making a Farming Contract
The making of the farming contract and the harvest notification are shown in
detail in Fig. 1. The process starts with the farmer making a preliminary agreement with the energy company for selling them straw. Based on this agreement,
the energy company can create in their ERP farmer database an entry for the
farmer, which they then use to store all information about the farmer that they
need. Similarly, the farmer can store the details of the agreement to the software system they use to store the farm data. The preliminary agreement acts
as a basis for the actual farming contract between the energy company and the
farmer. In the farming contract, the farmer and the energy company agree on
the approximate amount of straw the farmer is willing to sell. At harvest the
farmer will find out how much straw they actually have, and can inform the
energy company about how much straw they are capable of selling.
Fig. 1. The straw supply chain model from making of the farming contract until the
harvest notification
The process of making a farming contract will most likely be done primarily
using means that are not directly machine readable, i.e. paper contracts. While
electronic contracts are possible, the paper contract tradition is still extremely
strong and therefore likely to persist for quite a while. From the point of view
of the information management, the most important details for the farmer and
energy company to agree during the contract making process is the approximate
amount and location of straw the farmer is willing to sell. Using this information, the energy company can make preliminary plans for arranging their supply
chain to provide them with straw until the next harvest period.
J. Nikander
Typically, the machine-readable data transfer in the model starts with the
farmer sending harvest report to the energy company (step number 4). Previous communication is likely to be primarily on human-readable documents. The
harvest report is a report on amounts and locations, and possibly also moisture
percentage of the harvest. Assumption is that the straw will be baled for transport, and thus the basic unit of measurement that needs to be transmitted is the
number of bales available. If the baler can also measure the weight and moisture
of a bale, further information can also be transmitted. Location is assumed to
be either a street address, or GPS coordinates.
Order for New Fuel
The more detailed description of the model continues in Fig. 2. The left side of the
Figure covers the model from the creation of a new order for fuel to the sending
of a notification of straw retrieval to the farmer, while the right side covers the
model from straw pickup to the fulfillment of the fuel order. The communication
depicted by steps 6 through 8 within Fig. 2 happens inside the energy company’s
organization. The activity begins when the warehouse management determines
a need for more fuel (step 5) and sends a order for fuel (step 6) that starts a new
fuel procurement process. The most important things given to the new process
are the amount of fuel, the delivery location, and deadline for delivery.
The fuel procurement process forwards the order for fuel to the energy company’s farmer database to select the loads of straw that would fulfill the order
(step 7). If needed, the loss of straw between the harvest and the current time
can be estimated using loss modeling, if such is available. Currently, no loss
models are in use. The farmer database will send the fuel procurement process
information how fulfill the order (step 8). This message contains information
regarding straw lots to be retrieved, including the location and owner of each
lot, as well as information on the amount of straw.
After the fuel procurement process has gained information on what to deliver
to the power station, it can forward the information to the supply chain, which
will create a new straw delivery process to handle the pickup, transport, and
delivery of the straw. The fuel order that the fuel procurement process transmits to the straw delivery process (step 9) in Fig. 2 is a combination of the
fuel order (step 6) and the straw data (step 8). The supply chain then accepts
the order (step 10). Should the order be rejected, the fuel procurement process
needs to react. Possible actions include contacting another supply chain operator, adjusting the order, or advising the power station that new fuel order cannot
fulfilled given the criteria.
After the straw delivery process has accepted the order, the next action is
sending the farmer an advance notification of straw retrieval (step 11). This
message informs the farmer about the location and the amount of straw to be
taken, as well as the estimated time when it will happen.
papiNet-Standard for Straw Biomass Logistics
Fig. 2. The straw supply chain model from the creation of an order for new fuel, until
the delivery of straw to the power station
Fuel Delivery
The right side of Fig. 2 covers the model from straw pickup to the fulfillment
of the fuel order. In the model as part of the straw pickup, the supply chain
actor makes a quality measurement, where the weight and moisture of each bale
of straw is measured. In case quality measurement cannot be made, the model
skips to the retrieval of straw.
The quality measurement process is starts with the straw retrieval process
making a quality measurement order (step 12). The important information transmitted in the order is the location, the time, and the type of measurement to be
done. If needed, the order can be accepted in a manner similar to step 10. The
acceptance procedure is not included in the model). When the measurement has
been done, the quality measurement results can be sent to the straw retrieval
process, the fuel procurement process, and the farmer (step 13). The measurement results should contain the weight and moisture of each straw bale included
in the delivery.
When a lot of straw is picked up from a specific farmer, the straw delivery
process sends the farmer a message, which contains information about the location, the time and the amount of straw that has been picked up (step 14). At
the same time, the straw delivery process also sends this information to the fuel
procurement process (step 15) as well as to the power station (step 16).
When a load of straw arrives at the power station, the supply chain actor
will hand it over to the straw reception process. Upon receiving the load, the
reception process will send a message to the warehouse (step 17) and to the fuel
procurement process (step 18), informing them on the arrival of the new batch
of fuel. The message to the warehouse includes the amount of new fuel delivered,
while the message to the fuel procurement process informs this process that a
specific shipment has arrived. The fuel procurement process updates the farmer
J. Nikander
database (step 19). If the arrival of straw concludes a specific fuel procurement,
the fuel procurement process, and with it the order for fuel, end. Otherwise the
fuel procurement process will continue to wait for confirmations regarding the
rest of the fuel order.
Based on the model described in Sect. 3 and the example papiNet messages
created in the work, the usefulness of the standard on the problem of straw
supply chain was analyzed using a standard SWOT analysis. Results of the
analysis can be found in Table 1. The table is arranged to separately list all the
four aspects of SWOT analysis in order to give the reader a quick overview of the
advantages and disadvantages of the papiNet standard. More detailed discussion
regarding each element can be found in Sects. 4.1–4.3
Table 1. Strengths, weaknesses, opportunities, and threats on the use of papiNet for
straw logistics in bionenergy context
Relative identifiers for many elements are
No support for describing agricultural conContains all required message types
Capable of expressing the required Weak support for identifying non-forestry
actors and concepts
Sufficient ability to cross-reference
Mature, robust technology
Significant support in forest industry
May be unsuitable for environments where
forestry products are not dominant
Poor profitability of the sector
Supported in several countries
Existing platforms make it easier to In Finland, the lack of appropriate machinery
expand the scope of the solution
The papiNet standard includes message types that can be used to deliver all the
information the messages require in the process of fuel delivery. Table 2 shows
which papiNet message type is used in each step of the process. As can be
seen on the table, six different types of papiNet messages have been employed.
However, some steps could be handled also by other messages, such as using a
DeliveryMessage in step 18. Thus, this is not the only means to implement the
papiNet-Standard for Straw Biomass Logistics
model using papiNet. In an implementation of the virtual terminal concept, the
message types to be used then need to be decided.
An important part of the data exchange, is the identification of each actor in
the process, as well as cross-referencing to all relevant messages in the process. In
papiNet all actors are called parties, and each of them needs to have an unique
identifier. The standard includes a large number of different types of identifiers,
such as the global papiNetGlobalPartyIdentifier and VATIdentificationNumber.
The global identifier is an IANA Private Enterprise Number (PEN)5 , whereas
VATIdentificationNumber is the VAT number of the actor. Both are well-suited
to act as unique identifiers in the process.
Cross-referencing messages is required for example due to how a single order
for fuel may be divided into a large number of deliveries that can be done by
different supply chain actors. Thus there must be means to link each delivery
instruction to a specific order. In papiNet messages allow for references to other
messages, which include all the different message types in the model, and thus
the model offers sufficient supporst maintaining cross–referencing. However, in
some cases references require an authority to be assigned to them. Many possible authorities are relative to the message, which can make it difficult to pass
references along.
As can be seen, the papiNet standard includes data elements that can be
used to depict the required information for the supply chain to work. Thus there
are rather low barriers for including straw supply chains in the same information
system as forestry product supply chains and with papiNet it is possible to create
a multi-fuel supply chain information management system.
Table 2. The papiNet message types used in each step of the model. Natural language
stands for unstructured text documents in natural language, and N/A stands for steps
that are not associated with data exchange
Step Message
Step Message
Step Message
Natural language 8
DeliveryInstruction 15
Natural language 9
DeliveryInstruction 16
Natural language 10
BusinessAcceptance 17
MeasuringTicket 11
A possible problem with using the standard is the use of relative identifiers in
many messages. For example, many elements in the standard require defining
J. Nikander
the actor responsible for assigning the value, such as a message reference. Such
are needed, for example, when papiNet is used to send messages between different actors in the same organization. In such case it may not possible to use
the PEN or VAT identifiers, as these can be identical for both the sender and
the receiver. In such messages, it is possible to use for example the papiNet
AssignedBySender identifier, which allows for the use of arbitrary identifiers.
However, AssignedBySender is a relative ID, and thus cannot easily be passed
along a message chain. The same problem with relative identifiers is also found
elsewhere in the standard.
Another problem with the use of papiNet is that the standard does not have
any in–built means of describing agricultural concepts or products. The standard
has a product description element called <product>, under which different types
of products can be described in detail. However, the <product> element supports
seven categories of forestry products. Everything else must be described using
the other products category, which it is very generic and thus cumbersome to use.
The standard is originally designed for forestry products, and thus the handling of non-forest biomass can, in general, be cumbersome when using it. The
values of many attributes in the standard are forestry-related, such as the itemType -attribute in the <itemInfo> element, which describes the type of item
being measured. Many of the possible attribute values are directly related to
forestry, such as Log. <ItemInfo> has some general values, such as box, or pallet.
However, as these values are very generic, their usefulness is somewhat limited.
Some of the values can actually be deceptive in the context of agriculture. One
possible value of <ItemInfo> is baleItem, but in the standard this is assumed to
be a bale of pulp or paper. Thus using this value for straw bales is, at least in
principle, misusing the standard and can lead to confusion in an implementation.
Therefore, while the papiNet standard is a good choice for supply chain data
exchange in a business ecosystem where forestry products have a significant
presence, other solutions should be considered in cases where forestry products
are either not present at all, or have only a minor role.
Opportunities and Threats
The papiNet has significant international support in the forestry industry.
The papiNet initiative has over 40 member companies from both Europe and
North America, giving the standard a wide-spread and stable industry support.
This, in turn, translates to significant geographical coverage in the use of the
standard. Therefore papiNet is a safe decision in the sense that there is a strong
organization behind the standard that guarantees its future use.
Furthermore, the widespread support fo the standard also translates into
platforms that support the use of papiNet. Thus there are both existing platforms in production use where papiNet is supported, as well as people who have
experience in working with the standard and implementing it in actual production environments. Thus it is easier to take the papiNet standard into use than
try and apply technology which is not as robust or widely used.
papiNet-Standard for Straw Biomass Logistics
However, the dilemmas of what data exchange standard to use, or how to
apply them, are relatively minor compared to the problem of making the straw
ecosystem in bioenergy production attractive to the various actors. Straw supply
chains have been successfully created, as shown by the Fyn Power Station in
Denmark, and, in principle, there is interest among Finnish farmers to sell straw
for energy [19]. However, the amount of income a farmer can get from straw is
quite small, as typical price the farmer might get is under 20e per ton, which is a
fraction of the current low Finnish grain price of approximately 140e per ton [21].
In practice, this makes many farmers think the income is not worth the effort.
Unfortunately the straw price given to the producer cannot be increased, or straw
would become uneconomical source of energy compared to other fuels [19].
Furthermore, an efficient supply chain would require the weight and moisture
of the straw to be measured during baling, in order to give the energy company
information about the quality of the harvest. However, current balers in Finland
typically cannot do this and the low profit potential of selling straw does not
make new investments attractive.
Bale weight is relatively simple to measure even after baling, as this can be
done with any tractor which has a scale attached to the front loader. Measuring
the moisture of the bale after baling is can be more difficult, however, as it is
recommended that bales are wrapped in plastic for storage [8]. Penetration of the
wrap for measuring the moisture would aversely affect the storage characteristics
of the bale. There is a similar problem if the moisture content is measured during
bale retrieval, if the plan is to store the bales at the power station premises for
an extended period of time.
A relatively significant problem in the supply chain is also the fact that
most balers in Finland produce round bales. The round shape is problematic in
logistics, as it leads to relatively large amount of wasted space between bales.
For efficient logistics, large rectangular bales would be better than round ones.
However, rectangular balers are rare in Finland, and their large weight can be a
problem on Finnish fields.
Thus, the problems in setting up the information systems required are relatively minor ones compared to the other practical problems in using straw for
The Biomass Virtual Terminal
The communication model described in this paper is intended to be used as
part of a biomass virtual terminal, an integrated software solution for managing
lots of biomass stored in various locations. The goal of the terminal is to make
it possible to have a single system that can be used to manage all biomass an
energy company has intended to use in their power plants. By its very nature,
such system requires a large number of independent actors to cooperate and
share data. From the point of view of energy company, the situation is a relatively
traditional cooperative venture together with a large number of subcontractors.
The number of subcontractors required can be large, but there are only rather
few different types of actors involved in the process.
J. Nikander
The straw logistics process described in this paper contains two types of
subcontractors: the farmers and the supply chain actors. The virtual terminal
will also require two other types of subcontractors in order to handle forest
biomass, the forest harvester contractors and forestry companies. The forestry
companies manage and handle the primary forest biomass trade, and sell felling
waste and other such products to the energy company. The contractors do the
actual forest harvesting. The subcontractors are typically managed either by
the energy company directly, or by a specific party the energy company has
outsourced their fuel acquisition to. In the straw model the assumption is that
the energy company directly manages the subcontractors, as there currently are
no actors who would provide such service in Finland. However, should the straw
biomass market increase, such actors would most likely appear. Or, alternatively,
companies that currently manage forest biomass acquisition would expand their
business to cover also straw.
The whole production and supply chain for a co-fired CHP plant can, in
reality, grow large and complex. Thus it might be useful to approach the problem
as a whole using a virtual enterprise model, or similar [9]. So far, such overview
of the whole process chain has not been made, but could be done in the future.
Similarly, the different types of fuel included in such supply chain can
be extremely heterogeneous. Thus, if the biomass virtual terminal would be
extended into a complete fuel virtual terminal encompassing all fuel used by
an energy company, the papiNet standard in its current form would quickly
become inadequate. The element types, possible values, and especially the whole
<product> element would need to be expanded further and generalized.
From information technology point of view it is relatively simple to adapt the
papiNet standard for the use of agricultural bioenergy chains. The usability of
papiNet for straw has been demonstrated in this paper, and similar means can
be used for other agricultural biomasses. This is an advantage in an environment similar to the current bioenergy business ecosystem in Finland, where the
business is concentrated mainly on forestry biomass. Thus existing logistics systems already use forestry-related standards, such as papiNet. In an environment,
where this external motivation for the use of forestry standards does not exist,
a new assessment of the existing standards should be done, and some other
standard may well be more suited for the task.
Similarly, should the number of different types of fuel handled grow very
large, the limitations of the papiNet standard in its current form, will soon be
encountered. The standard is designed primarily to handle logistics and business
transactions in forestry, and not the energy industry. Therefore other standards
may be more suited in such case, or alternatively the papiNet standard would
require significant amount of further work.
However, the real barrier for the use of agricultural biomass in energy production is not the technology, but the business. For the farmer, the amount of
papiNet-Standard for Straw Biomass Logistics
compensation they get from selling straw for energy is very small compared to
the amount of resources and time required for harvesting and storing the product. Furthermore, the additional income gained from this needs to be used to
cover the additional need for fertilizers, as straw is no longer used to improve
the soil.
Thus, we can relatively easily develop the technological means for including
agricultural biomasses in the bioenergy ecosystem, but at least in the case of
Finland, it is difficult to make it into a profitable business that attracts farmers.
1. Alakangas, E., Virkkunen, M.: Biomass fuel supply chains for solid biofuels. The
publication produced in EUBIONET2 project (EIE/04/065/S07. 38628) (2007)
2. Burvall, J.: Influence of harvest time and soil type on fuel quality in reed canary
grass (phalaris arundinacea l.). Biomass Bioenergy 12(3), 149–154 (1997)
3. Ginet, C.: Standardised electronic data exchange in the French wood supply chain.
In: The Proceedings of the Precision Forestry Symposium 2014: The Anchor of
Your Value Chain (2014)
4. Laitinen, T., Lötjönen, T.: Energy from field energy crops - a handbook for energy
producers. Technical report, Intelligent Energy Europe (2009)
5. Lampathaki, F., Mouzakitis, S., Gionis, G., Charalabidis, Y., Askounis, D.: Business to business interoperability: a current review of XML data integration standards. Comput. Stand. Interfaces 31(6), 1045–1055 (2009)
6. Lindh, T., Paappanen, T., Kallio, E., Flyktman, M., Kyhk, V., Selin, P., Huotari,
J.: Production of reed canary grass and straw as blended fuel in Finland. Technical
report, VTT Technical Research Centre of Finland Ltd (2005)
7. Lord, R.: Reed canarygrass (phalaris arundinacea) outperforms miscanthus or willow on marginal soils, brownfield and non-agricultural sites for local, sustainable
energy crop production. Biomass Bioenergy 78, 110–125 (2015)
8. Lötjönen, T., Joutsjoki, V.: Harvest and storage of moist cereal straw. Best research
report no 2.1.5. Technical report, Cleen Ltd (2015)
9. Martinez, M., Fouletier, P., Park, K., Favrel, J.: Virtual enterprise organisation,
evolution and control. Int. J. Prod. Econ. 74(13), 225–238 (2001). Productive Systems: Strategy, Control, and Management
10. McKendry, P.: Energy production from biomass (part 1): overview of biomass.
Bioresour. Technol. 83(1), 37–46 (2002). Reviews Issue
11. Mtibaa, F., Chaabane, A., Abdellatif, I., Li, Y.: Towards a traceability solution
in the Canadian forest sector. In: 1st International Physical Internet Conference
12. Naik, S., Goud, V.V., Rout, P.K., Dalai, A.K.: Production of first and second
generation biofuels: a comprehensive review. Renew. Sustain. Energy Rev. 14(2),
578–597 (2010)
13. Naslund, D., Williamson, S.A.: Supply chain integration: barriers and driving forces
in an action research-based industry intervention. In: Supply Chain Forum: An
International Journal, vol. 9, pp. 70–80. Taylor & Francis (2008)
14. Nelson, M.L., Shaw, M.J., Qualls, W.: Interorganizational system standards development in vertical industries. Electron. Markets 15(4), 378–392 (2005)
15. Nurmilaakso, J.-M., Kotinurmi, P., Laesvuori, H.: XML-based e-business frameworks and standardization. Comput. Stand. Interfaces 28(5), 585–599 (2006)
J. Nikander
16. Panoutsou, C., Elbersen, B., Bttcher, H.: Energy crops in the European context.
Technical report, Biomass Futures (2011)
17. papiNet. The papinet standard v2r40. Technical report, The papiNet Initiative
18. Sami, M., Annamalai, K., Wooldridge, M.: Co-firing of coal and biomass fuel
blends. Prog. Energy Combust. Sci. 27(2), 171–214 (2001)
19. Satafood. Biotaloudella lisarvoa maataloustuotannolle (additional value to agricultural production through bioeconomy). Technical report, Satafood Oy (2014)
20. Skogforsk. Introduction to stanford 2010. Technical report, Skogforsk (2010)
21. F. F. Statistics. http://www.maataloustilastot.fi/en/uusi-etusivu. Accessed 7 July
A Case-Base Approach to Workforces’
Satisfaction Assessment
Ana Fernandes1, Henrique Vicente2,3, Margarida Figueiredo2,4,
Nuno Maia5, Goreti Marreiros6, Mariana Neves7, and José Neves3(&)
Organização Multinacional de Formação, Lisbon, Portugal
[email protected]
Departamento de Química, Escola de Ciências e Tecnologia,
Universidade de Évora, Évora, Portugal
Centro Algoritmi, Universidade do Minho, Braga, Portugal
[email protected]
Centro de Investigação em Educação e Psicologia,
Universidade de Évora, Évora, Portugal
Departamento de Informática, Universidade do Minho, Braga, Portugal
[email protected]
Departamento de Engenharia Informática,
GECAD – Grupo de Engenharia do Conhecimento e Apoio à Decisão,
Instituto Superior de Engenharia do Porto, Porto, Portugal
[email protected]
Deloitte, London, UK
[email protected]
Abstract. It is well known that human resources play a valuable role in a
sustainable organizational development. Indeed, this work will focus on the
development of a decision support system to assess workers’ satisfaction based
on factors related to human resources management practices. The framework is
built on top of a Logic Programming approach to Knowledge Representation
and Reasoning, complemented with a Case Based approach to computing. The
proposed solution is unique in itself, once it caters for the explicit treatment of
incomplete, unknown, or even self-contradictory information, either in terms of
a qualitative or quantitative setting. Furthermore, clustering methods based on
similarity analysis among cases were used to distinguish and aggregate collections of historical data or knowledge in order to reduce the search space,
therefore enhancing the cases retrieval and the overall computational process.
Keywords: Human resources management Logic programming Case-based
reasoning Knowledge representation and reasoning Decision support systems
1 Introduction
In a global and competitive world either organization is under a constant state of worry
and urgency, and to survive may need to adapt to new economic, organizational and
technological sceneries. Undeniably, the organizations should create innovative
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 191–206, 2016.
DOI: 10.1007/978-3-319-49944-4_15
A. Fernandes et al.
strategies to promote their own competitive advantages. A company’s staffs are a key
asset and play an important role in order to undertake its objectives [1, 2]. Definitely, a
company’s productively is toughly related to its people and their strategies or, in other
words, workers’ satisfaction stands for a significant instrument in human resources
management, leading to [1, 2]:
Enhanced quality of the offered products and services;
Positive attitude towards the company;
Better observance of deadlines;
Low personnel fluctuation;
Small absenteeism rates; and
Creativity and assuming of responsibilities.
The management of workers’ satisfaction encompass factors like training (aiming
the development of the workers’ skills), and the creation of work environments that
encourage the productivity, commitment and motivation. In this way, the organizations
reveal major concerns in promoting practices which give some support to its
employees, seeking the possible balance between professional and private lives [3, 4].
The present study addresses the theme of the Human Resources Management, in
particular with regard to Workers’ Satisfaction. However, the assessment of workers’
satisfaction is a complex phenomenon that involves a large number of factors, some of
which depend on the worker in itself, and even on the organisation [3–5]. Because of
this, it is difficult to assess the Workers’ Satisfaction since it is necessary to consider
different conditions with complex relations among them, where the available data may
be incomplete/unknown (e.g., absence of answers to some questions presented in the
questionnaire), or even contradictory (e.g., questions relating to the same issue with
incongruous answers). In order to overcome these difficulties, the present work reports
on the founding of a not common approach to Knowledge Representation and Reasoning [6], complemented with a Case Based attitude to computing [7, 8].
Undeniably, Case Based (CB) provides the ability to solve new problems by
reusing knowledge acquired from past experiences [7, 8], i.e., CB is used especially
when similar cases have similar terms and solutions, even when they have different
backgrounds [9]. Its use may be found in many different arenas, namely in Online
Dispute Resolution [9] or Medicine [10, 11].
This paper involves five sections. In the former one a brief introduction to the
problem is made. Then the proposed approach to Knowledge Representation and a CB
view to computing are introduced. In the fourth and fifth sections it is assumed a case
study and presented an answer to the problem. Finally, in the last section the most
relevant conclusions are described and possible directions for future work are outlined.
2 Background
Many approaches to Knowledge Representation and Reasoning have been proposed
using the Logic Programming (LP) epitome, namely in the area of Model Theory
[12, 13] and Proof Theory [6, 14]. In the present work the Proof Theoretical approach
A Case-Base Approach to Workforces’ Satisfaction Assessment
in terms of an extension to the LP language is followed. An Extended Logic Program is
a finite set of clauses, given in the form:
where the first clause stand for predicate’s closure, “,” denotes “logical and”, while “?”
is a domain atom denoting falsity, the pi, qj, and p are classical ground literals, i.e.,
either positive atoms or atoms preceded by the classical negation sign ¬ [6]. Indeed, ¬
stands for a strong declaration that speaks for itself, and not denotes negation-byfailure, or in other words, a flop in proving a given statement, once it was not declared
explicitly. Under this formalism, every program is associated with a set of abducibles
[12, 13], given here in the form of exceptions to the extensions of the predicates that
make the program, i.e., clauses of the form:
exceptionp1 ; ; exceptionpj ð0 j k Þ; being k an integer number
that stand for data, information or knowledge that cannot be ruled out. On the other
hand, clauses of the type:
?ðp1 ; ; pn ; not q1 ; ; not qm Þðn; m 0Þ
also named invariants, allows one to set the context under which the universe of
discourse has to be understood. The term scoringvalue stands for the relative weight of
the extension of a specific predicate with respect to the extensions of peers ones that
make the inclusive or global program.
Knowledge Representation and Reasoning – Quantitative Knowledge
On the one hand, the Quality-of-Information (QoI) of a logic program will be understood as a metric that will be given by a truth-value ranging between 0 and 1 [15, 16].
Indeed, QoIi = 1 when the information is known (positive) or false (negative) and
QoIi = 0 if the information is unknown. For situations where the extensions of the
A. Fernandes et al.
predicates that make the program also include abducible sets, its terms (or clauses)
present a QoIi ]0, 1[, which will be given by:
QoIi ¼ 1=Card
if the abducible set for predicates i and j satisfy the invariant:
? exceptionpi ; exceptionpj ; : exceptionpi ; exceptionpj
where “;” denotes “logical or” and “Card” stands for set cardinality, being i 6¼ j and i,
j 1 (a pictorial view of this process is given in Fig. 1(a), as a pie chart).
On the other hand, the clauses cardinality (K) will be given by C1Card þ þ CCard
if there is no constraint on the possible combinations among the abducible clauses,
being the QoI acknowledged as:
QoIi1 i Card ¼ 1 C Card ; ; 1 CCard
where CCard
is a card-combination subset, with Card elements. A pictorial view of this
process is given in Fig. 1(b), as a pie chart.
However, a term’s QoI also depends on their attribute’s QoI. In order to evaluate
this metric let us look to the Fig. 2, where the segment with limits 0 and 1 stands for
every attribute domain, i.e., all the attributes range in the interval [0, 1]. [A, B] denotes
the scope where the unknown attributes values for a given predicate may occur
(Fig. 2). Therefore, the QoI of each attribute’s clause is calculated using:
Fig. 1. QoI’s values for the abducible set for predicatei with (a) and without (b) constraints on
the possible combinations among the abducible clauses.
A Case-Base Approach to Workforces’ Satisfaction Assessment
Fig. 2. Setting the QoIs of each attribute’s clause.
QoIattributei ¼ 1 kA Bk
where ||A–B|| stands for the modulus of the arithmetic difference between A and B.
Thus, in Fig. 3 is showed the QoI’s values for the abducible set for predicatei.
Under this setting, a new metric has to be considered, which will be denoted as
DoC (Degree-of-Confidence), that stands for one’s confidence that the argument values
or attributes of the terms that make the extension of a given predicate, having into
consideration their domains, are in a given interval [17]. The DoC is figured using
DoC ¼ 1 Dl2 , where Dl stands for the argument interval length, which was set to
the interval [0, 1] (Fig. 4).
Thus, the universe of discourse is engendered according to the information presented in the extensions of such predicates, according to productions of the type:
predicatei [
clausej ððQoIx1 ; DoCx1 Þ; ; ðQoIxl ; DoCxl ÞÞ :: QoIj :: DoCj
Fig. 3. QoI’s values for the abducible set for predicatei with (a) and without (b) constraints on
the possible combinations among the abducible clauses.
ðQoIi pi Þ=n denotes the QoI’s
average of the attributes of each clause (or term) that sets the extension of the predicate under
analysis. n and pi stand for, respectively, for the attribute’s cardinality and the relative weight of
attribute pi with respect to its peers ( pi ¼ 1Þ.
A. Fernandes et al.
Fig. 4. Evaluation of the degree of confidence.
where [ , m and l stand, respectively, for set union, the cardinality of the extension of
predicatei and the number of attributes of each clause [17]. The subscripts of QoIs and
DoCs, x1, …, xl, stand for the attributes values ranges.
Knowledge Representation and Reasoning – Qualitative Knowledge
In present study both qualitative and quantitative data/knowledge are present. Aiming
at the quantification of the qualitative part and in order to make easy the understanding
of the process, it was decided to put it in a graphical form. Taking as an example a set
of n issues regarding a particular subject (where the possible alternatives are none, low,
moderate, high and very high), a unitary area circle split into n slices is itemized
(Fig. 5). The marks in the axis correspond to each of the possible choices. If the answer
to issue 1 is high the area correspondent is p =n, i.e., 0:75=n (Fig. 5(a)).
Assuming that in the issue 2 are chosen the
high and very high, the
correspondent area ranges between
p =n ,
½0:75=n; 1=n (Fig. 5(b)). Finally, in issue n if no alternative is ticked, all the
hypotheses should be considered and the area varies in the interval 0; p p =n ,
i.e., ½0; 1=n (Fig. 5(c)). The total area is the sum of the partial ones (Fig. 5(d)).
Fig. 5. A view of the qualitative data/information/knowledge processing.
A Case-Base Approach to Workforces’ Satisfaction Assessment
3 A Case Based Methodology for Problem Solving
The CB methodology for problem solving stands for an act of finding and justifying a
solution to a given problem based on the consideration of similar past ones, by
reprocessing and/or adapting their data/knowledge [7, 8]. In CB – the cases – are stored
in a Case Base, and those cases that are similar (or close) to a new one are used in the
problem solving process. The typical CB cycle presents the mechanism that should be
followed, where the former stage entails an initial description of the problem. The new
case is defined and it is used to retrieve one or more cases from the Case Base.
Despite promising results, the current CB systems are neither complete nor
adaptable enough for all domains. In some cases, the user cannot choose the similarity
(ies) method(s) and is required to follow the system defined one(s), even if they do not
meet their needs. Moreover, in real problems, the access to all necessary information is
not always possible, since existent CB systems have limitations related to the capability
of dealing, explicitly, with unknown, incomplete, and even self-contradictory information. To make a change, a different CB cycle was induced (Fig. 6). It takes into
consideration the case’s QoI and DoC metrics. It also contemplates a cases optimization process present in the Case Base, whenever they do not comply with the
terms under which a given problem as to be addressed (e.g., the expected DoC on a
prediction was not attained). This process that uses either Artificial Neural Networks
[18, 19], Particle Swarm Optimization [20] or Genetic Algorithms [14], just to name a
few, generates a set of new cases which must be in conformity with the invariant:
ðBi ; Ei Þ 6¼ ;
i.e., it denotes that the intersection of the attribute’s values ranges for cases’set that
make the Case Base or their optimized counterparts (Bi) (being n its cardinality), and
the ones that were object of a process of optimization (Ei), cannot be empty (Fig. 6).
Fig. 6. The updated view of the CB cycle.
A. Fernandes et al.
4 Methods
Aiming to develop a predictive model to assess workers’ satisfaction a questionnaire
was designed specifically for this study and applied to a cohort of 236 employees of
training companies. This section describes briefly the data collection tool and how the
information is processed.
The questions included in the questionnaire aimed to evaluate the degree of worker’s
satisfaction. The respondents participated in the study voluntarily and the questionnaires were anonymous to ensure the confidentiality of information provided. The
questions included in the questionnaire were organized into five sections, where the
former one includes the general questions related with workers’ age, gender, length of
service and functional area. The second one comprises questions related with the
workers’ opinions about the received training (Table Training Related Factors in
Fig. 7), while the third section is about occupational medicine service (Table Occupational Medicine Related Factors in Fig. 7). Finally, the fourth and fifth sections
comprise issues related with the workers’ opinions about the resources
(Table Resources Related Factors in Fig. 7) and organizational climate (Table Organizational Climate Related Factors in Fig. 7), respectively.
Workforces’ Satisfaction Knowledge Base
It is now possible to build up a knowledge database given in terms of the extensions of
the relations (or tables) depicted in Fig. 7, which stand for a situation where one has to
manage information aiming to estimate the workers’ satisfaction. Thus, the General
Information, Training, Occupational Medicine, Resources, and Organizational Climate Related Factors tables are populated with the responses to the issues presented in
the questionnaire, where some incomplete, default and/or unknown data is present. For
instance, in the former case the Functional Area is unknown (depicted by the symbol
⊥), while the opinion about the Applicability of the Training Received in the Daily
Work is not conclusive (High/Moderate).
The Length of Service column of the Satisfaction table is populated with 0 (zero), 1
(one), 2 (two) or 3 (three) that stands, respectively, for a length of service lesser than a
year, comprised in the range [1, 3[, ranging between [3, 5[, and with more than 5 years.
The Functional Area column, in turns, is filled with 0 (zero), 1 (one), 2 (two), 3 (three)
or 4 (four) that denotes human resources, quality, marketing, financial and commercial
issues, respectively. In the Gender column 0 (zero) and 1 (one) stand, respectively, for
female and male.
In order to quantify the information present in the Training, Occupational Medicine, Resources, and Organizational Climate Related Factors tables the procedures
already described above were followed. Applying the algorithm presented in [17] to the
table or relation’s fields that make the knowledge base for workers’ satisfaction
assessment (Fig. 7), and looking to the DoCs values obtained as described before, it is
A Case-Base Approach to Workforces’ Satisfaction Assessment
possible to set the arguments of the predicate satisfaction (satis) referred to below,
whose extensions denote the objective function regarding the problem under analyze:
satis : Age; Gender ; Length of Service ; Functional Area ; Training
Related Factors ; Occupational Medicine Related Factors ; Resources
Related Factors ; Organizational Climate Related Factors ! f0; 1g
where 0 (zero) and 1 (one) denote, respectively, the truth values false and true.
The algorithm presented in [17] encompasses different phases. In the former one the
clauses or terms that make extension of the predicate under study are established. In the
Fig. 7. A fragment of the knowledge base for workers’ satisfaction evaluation.
A. Fernandes et al.
subsequent stage the arguments of each clause are set as continuous intervals. In a third
step the boundaries of the attributes intervals are set in the interval [0, 1] according to a
normalization process given by the expression ðY Ymin Þ=ðYmax Ymin Þ, where the Ys
stand for themselves. Finally, the DoC is evaluated as described in Sect. 2.1.
Exemplifying with a term (worker) that presents the feature vector (Age = 37,
Gender = 0, Length of Service = ⊥, Functional Area = 2, Occupational Medicine Related
Factors = [0.45, 0.55], Training Related Factors = [0.65, 0.8], Resources Related Factors = 0.67,
Organizational Climate Related Factors = [0.58, 0.75]), one may have:
A Case-Base Approach to Workforces’ Satisfaction Assessment
5 A Case Based Approach to Computing
The framework presented previously shows how the information comes together and. In
this section, a soft computing approach was set to model the universe of discourse,
where the computational part is based on a CB approach to computing. Contrasting with
other problem solving tools (e.g., those that use Decision Trees or Artificial Neural
Networks), relatively little work is done offline [9]. Undeniably, in almost all the situations, the work is performed at query time. The main difference between this new
approach and the typical CB one relies on the fact that not only all the cases have their
A. Fernandes et al.
arguments set in the interval [0, 1], but it also caters for the handling of incomplete,
unknown, or even self-contradictory data or knowledge. Thus, the classic CB cycle was
changed (Fig. 6), being the Case Base given in terms of the following pattern:
Case ¼ f\Rawdata ; Normalizeddata [ g
When confronted with a new case, the system is able to retrieve all cases that meet
such a structure and optimize such a population, having in consideration that the cases
retrieved from the Case-base must satisfy the invariant present in Eq. (5), in order to
ensure that the intersection of the attributes range in the cases that make the Case Base
repository or their optimized counterparts, and the equals in the new case cannot be
empty. Having this in mind, the algorithm described above is applied to a new case,
that in this study presents the feature vector (Age = ⊥, Gender = 1, Length of Service = 2,
Functional Area = 1, Occupational Medicine Related Factors = 0.7, Training Related Factors = [0.5,
0.7], Resources Related Factors = [0.67, 0.75], Organizational Climate Related Factors = 0.75).
Then, the computational process may be continued, with the outcome:
newcase ðð1; 0Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 0:98Þ; ð1; 0:99Þ; ð1; 1ÞÞ :: 1 :: 0:87
new case
Now, the new case may be portrayed on the Cartesian plane in terms of its QoI and
DoC, and by using clustering methods [21] it is feasible to identify the cluster(s) that
intermingle with the new one (epitomized as a square in Fig. 8). The new case is compared with every retrieved case from the cluster using a similarity function sim, given in
terms of the average of the modulus of the arithmetic difference between the arguments of
each case of the selected cluster and those of the new case. Thus, one may have:
retrievedcase1 ðð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 1Þ; ð1; 0:95ÞÞ :: 1 :: 0:99
retrievedcase2 ðð1; 1Þ; ð1; 1Þ; ð1; 0Þ; ð1; 0Þ; ð1; 1Þ; ð1; 1Þ; ð1; 0:85Þ; ð1; 1ÞÞ :: 1 :: 0:73
retrievedcasej ðð1; 1Þ; ð1; 1Þ; ð1; 0Þ; ð1; 0Þ; ð1; 0Þ; ð1; 1Þ; ð1; 1Þ; ð1; 0:97ÞÞ :: 1 :: 0:62
normalized cases that make the retrieved cluster
Fig. 8. A case’s set divided into clusters.
A Case-Base Approach to Workforces’ Satisfaction Assessment
Assuming that every attribute has equal weight, for the sake of presentation, the dis
(imilarity) between newcase and the retrievedcase1 , i.e., newcase!1 , may be computed as
newcase!1 ¼
jj0 1jj þ þ jj0:98 1jj þ jj0:99 1jj jj1 0:95jj
¼ 0:14
Thus, the sim(ilarity) for simDoC
newcase!1 is set as 1 – 0.14 = 0.86. Regarding QoI the
procedure is similar, returning simQoI
newcase!1 ¼ 1. Thus, one may have:
newcase!1 ¼ 1 0:86 ¼ 0:86
a value that may now be object of a reading by part of experts. These procedures
should be applied to the remaining cases of the retrieved cluster(s) in order to obtain the
most similar ones, which may stand for the possible solutions to the problem. However,
there is yet a problem, i.e., when meeting the situation of multiple experts and multiple
assessment methods, how to integrate them?
In order to answer to this question, let us consider that one has p (p 2) experts
and the making ei/domaini, where ei stands for experti, and domaini denotes the metrics
or methods used by experti to read the study outcome, i.e., the pair (QoIi, DoCi), here
given in terms of simQoI;DoC
newcase!j , where “j” stands for the casej in the cluster(s) of
retrieved cases. domaini may be, for example, a set (e.g., {low, moderate, high, very
high}), an interval (e.g., [80, 90]), a number (e.g., 80), or an unknown value (e.g., ?).
A pictorial view of the process is given by Fig. 9 (that sets the relation
experts/readings), and Fig. 10 (that sets the overall assessment).
Fig. 9. The relation experts/readings.
Fig. 10. The overall assessment that is given by the areas’ sum.
A. Fernandes et al.
In order to evaluate the performance of the proposed model the dataset was divided
in exclusive subsets through the ten-folds cross validation [19]. In the implementation
of the respective dividing procedures, ten executions were performed for each one of
them. Table 1 presents the coincidence matrix of the CB model, where the values
presented denote the average of 20 (twenty) experiments. A perusal to Table 1 shows
that the model accuracy was 92.4 % (i.e., 218 instances correctly classified in 236).
Thus, the predictions made by the CB model are satisfactory, attaining accuracies
higher than 90 %. The sensitivity and specificity of the model were 95.0 % and
87.0 %, while Positive and Negative Predictive Values were 93.8 % and 89.3 %,
respectively. The ROC curve is shown in Fig. 11. The area under ROC curve (0.91)
denotes that the model exhibits a good performance in the assessment of workers’
Table 1. The coincidence matrix for CB model.
True (1) False (0)
True (1) 151
False (0) 10
1 – Specificity
Fig. 11. The ROC curve regarding the proposed model.
6 Conclusion
The workers’ satisfaction assessment is not only an inestimable practice, but something
of utmost importance in the organization efficiency context. To meet this challenge it is
necessary that the organizations optimize their efficiency in order to achieve excellence
practices. However, it is difficult to assess the workers’ satisfaction since it is necessary
to consider different variables and/or conditions, with complex relations entwined
them, where the data may be incomplete, contradictory, and even unknown. In order to
overcome these difficulties this work presents a Decision Support System to estimate
the workers’ satisfaction. The methodology followed was centred on a formal framework based on LP for knowledge representation and reasoning, complemented with a
A Case-Base Approach to Workforces’ Satisfaction Assessment
CB approach to computing. It may set the basis for an overall approach to such systems
in this arena. Indeed, it has also the potential to be disseminated across other
prospective areas, therefore validating an universal attitude. Indeed, under this line of
thinking the cases’ retrieval and optimization phases were heightened when compared
with existing tactics or methods. Additionally, under this approach the users may define
the cases weights attributes on-the-fly, letting them to choose the appropriate strategies
to address the problem (i.e., it gives the user the possibility to narrow the search space
for similar cases at runtime). Finally, it was presented a solution to the question: when
meeting the situation of multiple experts and multiple assessment methods, how to
integrate them? This is quite important, for example, in-group(s) construction and in
the assessment of their outcomes.
Acknowledgments. This work has been supported by COMPETE: POCI-01-0145-FEDER007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope:
1. Elena, N.I.: Human resources motivation – an important factor in the development of
business performance. Ann. Fac. Econ. Univ. Oradea 1(1), 1039–1045 (2012)
2. An, N., Liu, J., Wang L., Bai, Y.: Employee satisfaction as an important tool in human
resources management. In: Proceedings of the 4th International Conference on Wireless
Communications, Networking and Mobile Computing, pp. 1–4. IEEE Edition (2008)
3. Alfes, K., Shantz, A.D., Truss, C., Soane, E.C.: The link between perceived human resource
management practices, engagement and employee behaviour: a moderated mediation model.
Int. J. Hum. Resour. Manag. 24, 330–351 (2013)
4. Berman, E.M., Bowman, J.S., West, J.P., Van Wart, M.R.: Human Resource Management in
Public Service: Paradoxes, Processes, and Problems. SAGE Publications Inc., California
5. Petrescu, A.I., Simmons, R.: Human resource management practices and workers’ job
satisfaction. Int. J. Manpower 29, 651–667 (2008)
6. Neves, J.: A logic interpreter to handle time and negation in logic databases. In: Muller, R.,
Pottmyer, J. (eds.) Proceedings of the 1984 Annual Conference of the ACM on the 5th
Generation Challenge, pp. 50–54. Association for Computing Machinery, New York (1984)
7. Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological
variations, and system approaches. AI Commun. 7, 39–59 (1994)
8. Richter, M.M., Weber, R.O.: Case-Based Reasoning: A Textbook. Springer, Berlin (2013)
9. Carneiro, D., Novais, P., Andrade, F., Zeleznikow, J., Neves, J.: Using case-based reasoning
and principled negotiation to provide decision support for dispute resolution. Knowl. Inf.
Syst. 36, 789–826 (2013)
10. Guessoum, S., Laskri, M.T., Lieber, J.: RespiDiag: a case-based reasoning system for the
diagnosis of chronic obstructive pulmonary disease. Expert Syst. Appl. 41, 267–273 (2014)
11. Quintas, A., Vicente, H., Novais, P., Abelha, A., Santos, M.F., Machado, J., Neves, J.: A case
based approach to assess waiting time prediction at an intensive care unity. In: Arezes, P. (ed.)
Advances in Safety Management and Human Factors. Advances in Intelligent Systems and
Computing, vol. 491, pp. 29–39. Springer, Cham (2016)
A. Fernandes et al.
12. Kakas, A., Kowalski, R., Toni, F.: The role of abduction in logic programming. In: Gabbay, D.,
Hogger, C., Robinson, I. (eds.) Handbook of Logic in Artificial Intelligence and Logic
Programming, vol. 5, pp. 235–324. Oxford University Press, Oxford (1998)
13. Pereira, L., Anh, H.: Evolution prospection. In: Nakamatsu, K. (ed.) New Advances in
Intelligent Decision Technologies – Results of the First KES International Symposium IDT
2009. Studies in Computational Intelligence, vol. 199, pp. 51–64. Springer, Berlin (2009)
14. Neves, J., Machado, J., Analide, C., Abelha, A., Brito, L.: The halt condition in genetic
programming. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI),
vol. 4874, pp. 160–169. Springer, Heidelberg (2007). doi:10.1007/978-3-540-77002-2_14
15. Lucas, P.: Quality checking of medical guidelines through logical abduction. In: Coenen, F.,
Preece, A., Mackintosh, A. (eds.) Research and Developments in Intelligent Systems XX,
Proceedings of AI-2003, pp. 309–321. Springer, London (2003)
16. Machado, J., Abelha, A., Novais, P., Neves, J., Neves, J.: Quality of service in healthcare
units. In: Bertelle, C., Ayesh, A. (eds.) Proceedings of the ESM 2008, pp. 291–298. Eurosis
– ETI Publication, Ghent (2008)
17. Fernandes, F., Vicente, H., Abelha, A., Machado, J., Novais, P., Neves, J.: Artificial neural
networks in diabetes control. In: Proceedings of the 2015 Science and Information
Conference (SAI 2015), pp. 362–370. IEEE Edition (2015)
18. Vicente, H., Dias, S., Fernandes, A., Abelha, A., Machado, J., Neves, J.: Prediction of the
quality of public water supply using artificial neural networks. J. Water Supply: Res.
Technol. –AQUA 61, 446–459 (2012)
19. Haykin, S.: Neural Networks and Learning Machines. Pearson Education, Upper Saddle
River (2009)
20. Mendes, R., Kennedy, J., Neves, J.: Watch thy neighbor or how the swarm can learn from its
environment. In: Proceedings of the 2003 IEEE Swarm Intelligence Symposium (SIS 2003),
pp. 88–94. IEEE Edition (2003)
21. Figueiredo, M., Esteves, L., Neves, J., Vicente, H.: A data mining approach to study the
impact of the methodology followed in chemistry lab classes on the weight attributed by the
students to the lab work on learning and motivation. Chem. Educ. Res. Pract. 17, 156–171
Effective Business Process Management
Centres of Excellence
Vuvu Nqampoyi, Lisa F. Seymour(&), and David Sanka Laar
Department of Information Systems,
University of Cape Town, Cape Town, South Africa
[email protected], [email protected],
[email protected]
Abstract. This paper explains and describes how Business process management (BPM) Centre’s of Excellence (CoEs) can be effective. Thematic analysis
of data collected from two large South African financial services corporations
with operational CoEs produced a model based on the Integrated Team Effectiveness Model that shows the factors influencing effectiveness of a CoE. The
services the CoE provides as well as the industry standards it chooses to align
with were found to have the largest impact. This research provides practical
value by highlighting factors organisations can include in their planning and be
mindful of when establishing or improving the services of a CoE. The BPM
CoE Effectiveness Model presented in this work is a theoretical contribution in
this field and an extension to the ITEM model previously used in healthcare.
Keywords: BPM
BPM governance
Centre of excellence
1 Introduction
Due to economic changes and pressures, organisations have become interested in how
to enhance their business processes in order to improve business performance [1].
Business Process Management (BPM) is used to address this challenge. BPM refers to
all efforts made by an organisation to analyse, define and continuously improve its
fundamental activities [1]. Through BPM, an organisation can gain as well as sustain a
competitive advantage [2].
In order to succeed in the implementation of BPM, six core elements or success
factors have been suggested [3]. One of these elements is BPM governance. BPM
governance allows the organisation to establish clear roles and responsibilities to ensure
accountability in the implementation of BPM. The governance mechanism must provide guidance on process design and decision making for all processes in the organisation [3]. BPM governance is essential to ensure that BPM is embedded throughout
an organisation [4].
Although organisations have invested significantly in BPM initiatives, there have
been challenges in the implementation of BPM. One of these challenges has been how
to ensure that the delivery and sustainability of implementing BPM initiatives are
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 207–222, 2016.
DOI: 10.1007/978-3-319-49944-4_16
V. Nqampoyi et al.
consistent throughout the organisation [5]. This challenge is one that can be solved by
establishing a BPM Centre of Excellence [5] which we will refer to as a CoE.
The definition of a CoE is a central organisational team that is required to market
and embed BPM initiatives, perpetuating the benefits of BPM throughout an organisation. A CoE has been identified as indispensable when attempting to implement BPM
governance in an organisation [6]. It is also described as an important tool for the
implementation of BPM within an organisation and it is a critical success factor [7].
Yet, in a global survey [8], it was found that while organisations had established
CoEs the majority of the centres were only beginning to exert some valuable influence.
For example in a case study where a CoE had been established in the South African IT
service sector, the CoE had not added significant value [9]. Researchers have described
the recommended setup of a CoE, including the required services [3, 5, 10]. BPM
methods, tools and techniques have also been covered extensively in academic as well
as in the practitioner literature. However a gap exists in determining how to ensure that
a BPM team provides the value that it is expected to deliver [3].
Therefore, the purpose of this research is to determine the factors that can contribute to the effectiveness of a CoE within an organisation and provide organisations
with relevant suggestions to help them realise the full benefits of a CoE. Therefore the
following research questions are proposed:
• What are the characteristics of an effective CoE?
• What factors support the effectiveness of a CoE?
To answer these questions, a literature review on related works is presented followed by research methodology and then the data analysis and findings.
2 Literature Review
The purpose of BPM governance is to ensure that the strategy of the organisation is
aligned with its business processes [6]. BPM governance ensures that transparent and
appropriate accountability is established with regard to the roles and responsibilities of
BPM [3] ensuring that initial improvement and design efforts are coordinated and that
business processes continue to operate optimally through continuous improvement
efforts [11]. BPM Governance is also concerned with regulatory compliance and
managing risk, increasing an organisation’s efficiency and accountability as well as
providing measurement for decision makers [12]. BPM governance in an organisation
cannot be easily separated from managing the people, the technology and the functions
that perform the business processes [6]. Therefore BPM governance is challenging to
implement but is crucial for the success of business processes [12, 13].
In order to ensure business processes are customer focused and efficient, business
process success does not depend only on improving the business processes, but also on
structural changes [11]. Thus the greatest challenge for organisations is to align the
existing functional structure of the organisation to a process governance structure [12].
This can be achieved using different kinds of mechanisms and the way in which the
organisation is structured affects the method of process governance [11]. One such
mechanism is the establishment of a CoE [6].
Effective Business Process Management Centres of Excellence
The Effectiveness of a CoE
A CoE is a central organisational team [5]. Teams can be multidimensional and consist
of processes and structures that depend on the composition of the team, the type of
work expected of it, the scope of work as well as the interactions of the team [14]. The
effectiveness of a team refers to the extent to which the team successfully meets its
objectives [15]. In order to measure effectiveness, high level measures such as
organisational effectiveness can be used. However, these measures may not take into
account the specific goals set by the team. Research has not provided a clear direction
on how effective teams are created and maintained [14]. The head of a BPM team is
tasked with ensuring that business performance of the organisation is improved by
establishing efficient processes [16]. However, at present there are also no clear suggestions on how increased performance is measured. The organisational context
directly influences the effectiveness of the team and also determines under which initial
conditions the team will be effective. The resources, incentives and rewards as well as
the policy and social context are also influences [14].
Organisational Context
Establishing a CoE requires a number of organisational changes, including the structure
of the organisation [7]. The structure is changed to align with the business environment
and the subsystems are aligned with the organisation’s strategy to ensure the organisation performs effectively [1]. The CoE is an independent department with responsibility spanning across the organisation [1]. If the CoE reports to the IT department a
misalignment with the strategic objectives of the organisation may be caused [7]. The
CoE should have all the expert knowledge in a centralised place, and provide and
disseminates information as well as manages requests from different areas of the
organisation [7].
BPM maturity models have been developed to assist organisations achieve business
process excellence [17]. BPM maturity models suggests that the level of BPM maturity
of the organisation influences how the CoE is structured [18]. In organisations with
high BPM maturity level, the CoE is proactive [19]. However, some researchers recommend that organisations which operate in low dynamic environments, should only
aim for low BPM maturity, as the costs to setup and maintain additional BPM capabilities may not be worth the effort. Achieving the highest level of maturity may not
necessarily be suitable for the environment of every organisation [18].
Another effect of implementing BPM initiatives is change in the culture of the
organisation in that the attitude and behaviour towards business process improvements
changes [2]. Culture is comprised of invisible values with which an organisation
identifies [20]. A process oriented culture implies that the organisation values BPM
initiatives and promotes business process effectiveness [17]. Organisations that value
customer orientation ensure that the customer is considered the ultimate goal of
business processes [20]. To ensure their BPM success organisations strive to maintain
values like customer focus, empowerment and innovation [17]. This is in contrast to
organisations where business units function in silos, resulting in customer processes
V. Nqampoyi et al.
that the organisation does not understand well [1]. Excellence as a value in the
organisation promotes the culture of a workforce dedicated to eliminating shortcomings
as well as maintaining optimum performance and quality in all business processes [20].
The organisation adapts and fosters a culture that is willing to change [3].
Services of a CoE
Task design depends on the task type (service) and features of the task provided by the
CoE. One of the services of the CoE is Strategic alignment which ensures that all
processes are aligned to the organisational strategy. A vision and methods of achieving
the vision are drawn up with milestones [16]. Strategic alignment also ensures that
business processes work together [19]. The CoE needs to be well equipped to provide
the strategy committee with performance information on organisational processes as
well as advise the committee on different strategic opportunities [21].
With the implementation of BPM, organisations begin to view themselves from the
value chain perspective and realise their performance is linked to the performance of
their core processes [21]. The CoE has the ability to provide a wide variety of process
measures and/or process analytics making them valuable [19]. Appropriate and cost
effective methods need to be selected for collecting and analysing process performance
data [21]. Process performance, as a service provided by the CoE, includes defining
how the business processes will be measured as well as measuring, reporting and
monitoring the process performance [5]. The CoE can then assist the organisation in
determining why certain business processes are not performing well and is therefore in
a position to suggest improvement solutions [19].
Process maturity assessment is another important service in which the CoE assesses
the BPM maturity of the organisation on an ongoing basis [19]. This helps in defining
accurate BPM vision, strategy and a roadmap [16]. Currently, there is a wide variety of
maturity models which differ in their supporting methodology, foundation in theory,
depth and designs, although they are all based on critical success factors and assist with
designing a BPM roadmap [19].
In order to succeed in monitoring, controlling and improving processes, an
organisation first needs to understand their processes. This can be achieved via the
process architecture service by defining the value chain and all the processes that are
linked to the organisation’s value chain and how they interact with each other, or it may
be more in-depth, including performance measures, process managers and links to the
organisational mission, vision and strategies [5, 16, 21]. It is advisable to have a
repository to maintain links between the process artefacts and allow for easy updates
[19, 21]. Some of these artefacts may be maintained by other units requiring a close
link with management and modification of the enterprise architecture [19].
Another important task is Process Improvements and Change. These focus on
designing the best version of a business process, taking into account compliance,
financial impact, risk assessments [19] and improvement opportunities [5]. The CoE
then manages all process change initiatives in the organisation [5, 19, 21]. This ensures
a smooth transition in the procedures and reporting lines in the organisation [19].
Effective Business Process Management Centres of Excellence
There is also the project support service which ensures that process thinking,
methodology and frameworks are applied at all times during the course of projects [19].
The CoE is also tasked with providing training and education service to all
employees, particularly process managers, equipping them to manage processes daily
[21]. This task is positively related to the demand and success of BPM [19]. The CoE is
the marketer of BPM within the organisation and provides information about policy
guidelines, procedures, benefits realised from implementations, and how far in the
BPM strategy roadmap the organisation has travelled [16].
Organisations are increasingly required to comply with process legislation
depending on the industry and the country, Sarbanes Oxley being one such legislation
for financial institutions [21]. Process Compliance involves ensuring that legislation is
built into the process architecture and that process models are maintained and updated
regularly [21]. The CoE needs to have an in-depth understanding of the legislation
which the organisation needs to comply with, and customise BPM techniques, methods
and tools to take different standards into account [19].
CoE Roles
The work performed by the CoE has been described in great depth by researchers [5,
10, 19]. However, research also cautions that the labels given to these services may
differ from organisation to organisation and many of the services specified could also
be offered by other teams in the organisation [19]. Various roles in a CoE have been
identified and discussed by researchers, and include BPM Executive [12], Process
Owner, Process Architect, Process Analyst [4]; Head of BPM, Process Expert, Process
Coordinator, Process Modeller, Enterprise Architect and Change Management Advisor
[16]. However some roles are not adequately supported, and the literature is still full of
contradictions [2]. Further research should focus on the team composition in relation to
the processes and outcomes to assist in understanding what expertise is required in a
team and how it can be organised [14]. The team needs to be able to function well
together interpersonally as well as technically [14].
Team Processes and Traits
While some researchers confirm that a positive team influences the effectiveness of the
team only a few studies explain how to create the enabling conditions [14]. A framework for understanding the multi-dimensional relationships within health care teams,
named an Integrated Team Effectiveness Model (ITEM) has been developed [14].The
groupings and relationships in ITEM are shown in Fig. 1. Yet a gap exists with regards
to beneficial BPM team-processes and traits, although it is agreed that the people
available to support BPM affect the structure and processes of the CoE [7].
V. Nqampoyi et al.
Organisa onal
Task Design
Task Type
Task Features
Team ComposiƟon
Team Effec veness
ObjecƟve Outcomes
SubjecƟve Outcomes
Fig. 1. ITEM categories adapted from [14]
3 Research Methodology
The research employed a descriptive and explanatory approach to describe CoE
effectiveness and explain the factors influencing effectiveness. A case study was performed of 2 large South African banks with CoEs. These banks operate across the
continent and the sector operates in a very competitive environment with a constant
requirement to improve business performance [1]. The sector is well developed as the
2015-2016 World Economic Forum Global Competitiveness Survey ranks South
Africa 12th globally in terms of Financial Market Development [22]. The researchers
applied judgement sampling, particularly the key informant sampling technique [23].
Table 1 lists the roles of the participants in each organisation. The codename P1-P7 is
used to identify the 7 participants but retain their anonymity. Semi-structured interviews [24] were recorded and later transcribed. The researchers made use of notes
during the interviews to document points that might not be heard in the recording, for
example, body language. Thematic analysis was used to identify themes [25]. The first
iteration of coding yielded a total of 183 themes. The third round of coding reduced the
themes to 35 basic themes which were then categorised the themes using ITEM [14].
Table 1. List of selected participants
Interviewees in Bank 1
Head of process CoE
Value chain lead
Process steward
Value chain lead
Interviewees in Bank 2
Head of process CoE
Process engineer 1
Process engineer 2
4 Results and Discussion
This section discusses the findings of this research supported by quotes from the
participants and from the documentation that was used. In the data analysis it emerged
that some themes were more prevalent than others. Figure 2 shows the resultant BPM
CoE effectiveness model. Included in the model are the basic themes with the number
of quotes or empirical observations for each theme across all data sources. The model
and themes will now be discussed.
Effective Business Process Management Centres of Excellence
Fig. 2. BPM centre of excellence effectiveness model
Team Effectiveness
The literature on team effectiveness has failed to clearly specify what teams are
expected to be effective at and has not taken into account the specific goals that a team
sets for itself [14]. However in the findings of this research the objective outcomes as
well as the subjective outcomes of the CoE are very clear.
Objective Outcomes. With regards to the objective outcomes the most prevalent
theme is Customer Experience where the better the Customer Experience the more
effective the CoE. Thus the CoE is required to engineer the best customer experience
for the client:
“One of the key things we are always trying to drive is customer experience” (P1),
“We’re trying to get to a point where we deliver consistent customer experiences obviously
consistently great” (P2).
Another prevalent theme that emerged is financial impact. This means if the CoE is
effective it should translate into cost savings or revenue generation to the organisation,
as the processes of the organisation are more efficient:
“People in corporate really want to hear that’s how you’re going to make money. And we were
able to link having a process Centre of Excellence to this money” (P1).
“So the performance is measured based on the targets that have been set which is primarily
cost reduction and what are the savings to the bank” (P2).
A large part of the work of the CoE is to ensure that there are efficient processes to
support the business of the organisation. Thus, efficient processes are used as a subjective outcome to measure effectiveness of the CoE:
“By making it more efficient, by increasing the levels of automation, by giving our dealers and
our salespeople more capacity” (P5),
“We as a team will drive efficiencies across all the operations” (P7).
V. Nqampoyi et al.
Enabling business was found to be a measure of effectiveness of the CoE. This
refers to enabling the business team to attract more customers with the same number of
staff. This objective works hand in hand with efficient processes.
“We are sort of the enablers to the business” (P5),
The last objective outcome for the Effectiveness of a CoE is delivering value.
A CoE cannot be seen as effective if it is not seen to be delivering value to the
“you’re kind of dead if you start thinking about capability without delivery” (P1),
“because in business what happens is once they see especially if they start to see delivery then
they get excited and then they give you more” (P2)
Subjective Outcomes. The CoE can also evaluate its performance based on subjective
outcomes. Measuring performance of an internal process is done and is used as an
outcome of the CoE.
“So the simple measurement for process Centre of Excellence really it’s the effectiveness and
the efficiency of the processes that we design” (P4).
“Whether the process performance is improved so where we have dashboards we can say okay
productivity was at x and the baseline and now it’s at y” (P5).
Managing the value chain across the organisation is another subjective outcome
that emerged from the research. It is noted in the literature that with the implementation
of BPM, organisations begin to view themselves from the value chain and realise that
their performance is linked to the performance of their core processes [21]. The CoE
designs and manages the interactions and processes along the value chain.
“So our role is really to look at all the initiatives we are trying to drive across the value chain”
“thing is that we actually need them to use the stuff to actually help us to manage the whole
value chain (Training)” (P1).
Managing change that occurs as a result of the CoE is another theme that emerged
as a subjective outcome. Managing change can also refer to the culture change that may
be occurring in terms of the enhancements the CoE is trying to achieve.
“the change management that’s what we are actually driving as well” (P4).
“other departments in the organisation focus on running the bank initiative, the CoE takes care
of change in the bank” (P6).
“it’s the mind sets of people so we have to start with the DNA or the culture change” (P5).
“We also look at measuring people on the success of Lean for managers, so that’s the culture
change how many change managers actually identifying and implementing initiatives” (P5).
Effective Business Process Management Centres of Excellence
Task Types
It emerged that the task types of the CoE team greatly depends on the kind of services
delivered. Literature states that the governance mechanism must provide guidance on
process design and decision making for all processes in the organisation (3). The
services that emerged included process modelling, process design, process improvement and increasing the process maturity level of the organisation. Some of these
services are as a result of aligning to industry standards.
“we then design processes to say this is how you could provide the service that you want to
provide to the customers” (P4).
“so that’s basically part of our continuous improvement function” (P2).
“check if there are any as-is processes currently on the system if there’s no processes then get
those mapped” (P3).
“We said we would move from where we think we are Bank 1 level 1.5 to level 4 where you get
standardised processes consistent in 36 months” (P1).
The Task Type also depends on the Scope of Work. The importance and challenges
of BPM governance increases with increasing scope of the business processes and the
greater the number of functional units that are impacted by a business process the
greater the challenges [11].
“your big projects which are your transform projects…technically span more than 6 months”
The type of work was found to depend on how the work is broken down and varies
depending on the service.
“So where you are introducing incremental change you do a CI project” (P2).
Also, the Task Type is affected by the CoE outputs, in terms of the documents such
as business cases that are produced. It depends on how the work is broken down in
order to deliver those services.
“we will prepare documents, which are required with all the findings” (P7).
Lastly, Innovation has an influence on the type of work that the CoE produces. The
CoE personnel are required to be innovative in how they think of the processes of the
organisation and this informs the type of work that they produce.
“Will be competition within the bank so all the process improvements will be approved over
there and the best project will win” (P7).
“those guys are innovating across process lines not across product lines” (P1).
V. Nqampoyi et al.
Task Features
Task features refer to the characteristics of the work being carried out and many studies
have overlooked how these affect team functioning [14]. In this research, a number of
factors have been discovered that contribute to the task features. One of these is
aligning to industry standards; the use of industry standards seems to ensure that
everyone on the team has a clear understanding of what needs to be done. Good
practice recommends defining process standards and linking business processes to
information technology [4]. There is great emphasis on the use of well-known
methodologies to perform the work of the CoE.
“we have pretty much had to align if you think about the generic standards” (P1),
“defined methodologies that we apply like lean and lean six sigma” (P5),
The use of tools within the CoE ensures that the work is guided and uniformly
performed. Tools also help with aligning to industry standards
“We’re currently using ARIS,” (P3),
“primarily the modelling systems are used so one would be System Architect” (P6),
“we had to get a whole lot on the tools side, get proper process mapping tools” (P1).
Another important characteristic of the work of the CoE is providing accessible
information. This refers to availability of information that is required for the work of
the CoE,
“a SharePoint kind of system where the guys share information” (P1),
“Intranet has got in terms of information on how to do some stuff so one could always go in
there and look at information” (P2),
Autonomy of roles is another feature of their tasks. It refers to the ability of each
role within the CoE to make own decisions and work independently of the team.
“Each individual works on his project, you can take your decision because you’re the right
person on the project” (P6),
“where we are right now the guys are quite comfortable to really make decisions without
consultation” (P4),
Team Composition
The Team Composition is one of the factors that influence the effectiveness of the CoE.
This refers to the capabilities of the team and the capabilities the team requires. The
literature referred to roles rather than capabilities within the CoE, however from what
has emerged in this research it seems to be the capabilities within the team that are
required and not the roles.
Effective Business Process Management Centres of Excellence
“We kind of sourced a lot of these guys from manufacturing companies” (P1),
“I am not a permanent employee of bank 2, so I am a contractor” (P6),
“it’s got different kinds of people with different kinds of experiences, different backgrounds”
“to have process thinking people, people with logical thinking, people who can really build
relationship by doing the right things and flexible people” (P4).
Team Structure
Team Structure emerged as one of the characteristics not accounted for in ITEM
although reference is made to the fact that a team is made up of structures. In a 2015
survey clear responsibilities was a top 5 ranked success factor [26]. This assertion is
supported in this study in that the team structure has an influence on everyone
understanding their role and responsibilities and is an important influence on the
effectiveness of the CoE. The team structure consists of the way the team is structured,
allocated and their reporting lines.
“We currently report to the head of enablement basically an operations executive that sits on
the management board” (P5)
“We have structured them by value chain horizontal rather than by products” (P1),
“People are allocated across different spaces, in different domains” (P7).
Team Processes
Team processes are positively associated with perceived team effectiveness [14]. The
most prevalent theme that emerged is that of training for the team, although in the
literature training is only suggested as a service that the CoE provides. Aligning to
Industry standards results in required training for the team on those industry standards.
“basically we trained everyone on Lean six sigma black belt” (P5)
Avenues for collaboration is another prevalent theme, where the team has frequent
meetings and forums to discuss issues, get support and share what each member of the
team is working on. Social media tools have created alternative ways of collaboration.
“We have a WhatsApp group where we collaborate…” (P2),
“we have a working session on a Thursday, which is like for 3 to four hours” (P3),
“through our weekly forums … they present their stuff” (P4).
V. Nqampoyi et al.
Stakeholder involvement in the work of the CoE is an important team process that
impacted CoE effectiveness. This is supported by a 2015 survey identifying integration
of important stakeholders as one of the top 5 success factors [26].
“key part of our role is to make sure that we manage expectations” (P2),
“we work with the operations people… the heads of business units” (P5),
“We like to think of these internal people as really stakeholders you need to manage” (P1).
The style of management affects the way in which the team works together and how
decisions are made within the team, and generally how issues are resolved.
“As and when there is something we need to talk about we just gather at each other’s desk and
then we pick up the conversation and then we deal with that” (P4),
“You can address them with your manager especially if its availability of stakeholders where
you would then have to escalate” (P3),
“we’ve got open door policy like all the way up to the Head” (P2).
Communication, project prioritisation as well as conflict resolution were also
themes that emerged from the data. Positive communication patterns and low conflict
levels are characteristics of effective teams [14]. The data revealed that there were
many channels for communication and that communication in the team can be done at
many levels.
Team Traits
Team traits are practices that have been internalised by the CoE and that have stabilised
over time. Team Traits that emerged included accountability, team culture and team
understanding. A clear understanding of the responsibilities of the CoE and what each
role in the centre is accountable for is required.
“The process steward basically runs the task so to speak and they hold the business relationship” (P2),
“process engineer for each of the departments is responsible for that whole process” (P5),
The team culture was referred to as process driven and delivery focused:
“So we are delivery focused” (P2),
“I mean we are more a process driven department and we try to really entrench the culture of
process thinking within the organisation” (P4).
The team understanding refers to unspoken rules and procedures that the team
knows. These form part of the team traits. The team had a tacit understanding of when
approval and formal methods were not needed.
Effective Business Process Management Centres of Excellence
“your value chain lead will give you the project so you don’t necessarily need approval in that
sense because they have already discussed it with business” (P3),
“depending on the complexity of each initiative we can follow the full methodology or in some
cases we just do a rapid assessment and you get to the solution you implement it” (P4).
Organisational Context
While organisational structure and culture affect the team’s outcomes, few studies have
focused on context [14]. Firstly factors within the organisation that drive the need for a
CoE were classified as CoE drivers.
“the only way you get a predictable, consistent, standardised customer experience is if you
actually engineer the processes to deliver it” (P1),
“that team basically looks after the target states where you want to take the organisation” (P4).
The second factor was clear goals set by the organisation for the team and then
cascaded down to each individual within the team.
“Main target or goal for us is to get the bank self-driven in running these efficiencies” (P7),
“so we said we would move from where we think we are Bank level 1.5 to level 4 where you get
standardised processes consistent in 36 months. (Five stages of BPM Capability)” (P1).
Support from management at ExCo level is a reinforcement of the two factors
above, as it highlights the importance of achieving these goals. Top management
support was identified as the key enabler of success in BPM in a 2015 survey [26].
“I also form part of the strategy ExCo which is very important so that this is seen as very
strategic and you part of shaping the strategy” (P1),
“so the top management we’ll discuss the challenges we’ll discuss the objectives and approach
the management once we’ll get a buy in from them”(P7).
It is common for organisations to provide training to all members that work with
processes, including managers at the operational level, in order to equip them to
manage processes daily [21]. Although training for the organisation is only described
as a service that the CoE provides, it affects the effectiveness of the team. Training
creates awareness and provides a conducive environment for the CoE to operate in.
“what the Process Centre of Excellence does is to conscientise people around doing things
better” (P2),
“so the leadership programmes, there’s programs for middle managers, there’s programs for
newly appointed managers, you’ve got programmes for staff as well, that’s just internal programme” (P4),
“going forward we will be training each and every team or each and every individual within
the bank so that they can run their own efficiencies” (P7).
V. Nqampoyi et al.
Resultant Model
The research identified that the effectiveness of the CoE is determined by a number of
interrelated factors. This is shown in Fig. 2. The Organisational Context lays the
foundation for the CoE and provides the reasons for its existence. It sets the goals that
need to be achieved by the CoE and ensures that the Centre has the necessary support
required from senior management in order for it to be effective. The Organisational
Context influences the task design as well as the team structure.
The Team Structure influences the work that is done in the team such as the task
Design. This is not in ITEM but describes the reporting line of the CoE and how the
resources are allocated.
The Task Design includes the type of work that the CoE performs, how the tasks
differ from other tasks, what capabilities that the CoE has and how the team works
together. The services provided by the CoE and the alignment with industry standards
are the dominant elements in task design. This research suggests that capabilities of the
team are more important than their roles due to the fact that different roles can be used
for the same capabilities. The Task design in turn affects the team processes, traits and
The Team Processes are processes that are unique to the team. They directly affect
the effectiveness of the team. Collaboration, training and stakeholder involvement are
the most important processes of the team. The team processes also affect the team traits
and team effectiveness.
The Team Traits are unspoken rules in the team. The accountability, culture and
team understanding are all embedded in the team and are part of the makeup of the
team. The team traits affect the team processes and the perceived team effectiveness.
The Team Effectiveness is measured and identified by how the CoE views itself and
how the team is viewed by the organisation. The organisation has certain expectations
of the CoE that convince the organisation of the effectiveness of this team, the dominant factor being customer experience. The members of the CoE also have their own
measures that ensure that the team is working optimally and is effective such as
measures of process performance.
Generalisability and Limitations. The model presented in this paper focuses on the
effectiveness of the CoE team and is limited in that it does not extend to other potential
factors such as business strategy and IT infrastructure which would also impact the
CoE team. This is a limitation. In this study factors were identified from 7 participant
interviews and theoretical saturation was not reached. Hence these themes are not
necessarily complete. Clearly the sector the organisation operates in would have a
strong impact and that was the contextual factor that appeared to have the strongest
impact. It seems reasonable considering financial sector maturity in South Africa that
this model can be extended to other organisations in developed financial sectors.
Effective Business Process Management Centres of Excellence
5 Conclusion
This paper has investigated the factors that influence the effectiveness of a BPM CoE.
The results indicate that the effectiveness of a CoE depends on a number of interrelated
factors including the organisational context, team structure, task design, team processes, team traits, and challenges. Together, these factors produce a model which
explains factors influencing effectiveness of a CoE. For practitioners, this will be
particularly useful when establishing a CoE. From a theoretical perspective the ITEM
categories and relationships which were developed to explain the effectiveness of
health care teams were found to be valid for CoEs within the financial sector with all
the relevant themes presented. The model was extended to include the Structure category and all basic themes relevant for a CoE were derived from the case. This presents
a new theoretical contribution. The context of this study was the financial sector in
South Africa, further quantitative studies could validate this model and confirm whether these factors are applicable in other contexts. This research also highlighted the
key role of process owners in BPM effectiveness and further studies are recommended
that explain the organisational models that enable stronger collaboration between
process owners and COEs.
1. Trkman, P.: The critical success factors of business process management. Int. J. Inf. Manage.
30(2), 125–134 (2010)
2. Niehaves, B., Poeppelbuss, J., Plattfaut, R., Becker, J.: BPM capability development–a
matter of contingencies. Bus. Process Manag. J. 20(1), 90–106 (2014)
3. Rosemann, M., vom Brocke, J.: The six core elements of business process management. In:
vom Brocke, J., Rosemann, M. (eds.) Handbook on Business Process Management 1,
pp. 105–122. Springer, Heidelberg (2015)
4. Doebeli, G., Fisher, R., Gapp, R., Sanzogni, L.: Using BPM governance to align systems and
practice. Bus. Process Manag. J. 17(2), 184–202 (2011)
5. Jesus, L., Macieira, A., Karrer, D., Rosemann, M.: A Framework for a BPM Center of
Excellence (2009). http://www.bptrends.com/publicationfiles/FOUR
6. Paim, R., Flexa, R.: Process Governance, Part II (2011). http://www.bptrends.com/processgovernance-part-ii/
7. Levina, O., Holschke, O.: Reusable decision models supporting organizational design in
business process management. In: BUSTECH 2011, the First International Conference on
Business Intelligence and Technology, pp. 45–50 (2011)
8. Harmon, P., Wolf, C.: Business Process Centers of Excellence Survey (2012). http://www.
9. Siriram, R.: A soft and hard systems approach to business process management. Syst. Res.
Behav. Sci. 29(1), 87–100 (2012)
10. Jesus, L., Macieira, A., Karrer, D., Caulliraux, H.: BPM center of excellence: the case of a
Brazilian company. In: vom Brocke, J., Rosemann, M. (eds.) Handbook on Business Process
Management 2, pp. 399–420. Springer, Heidelberg (2015)
V. Nqampoyi et al.
11. Markus, M.L., Jacobson, D.D.: Business process governance. In: vom Brocke, J.,
Rosemann, M. (eds.) Handbook on Business Process Management 2, pp. 201–222.
Springer, Heidelberg (2010)
12. Jeston, J., Nelis, J.: Management by Process. Butterworth-Heinemann, Oxford (2008)
13. Spanyi, A.: Business process management governance. In: vom Brocke, J., Rosemann, M.
(eds.) Handbook on business process management 2, pp. 223–238. Springer, Heidelberg
14. Lemieux-Charles, L., McGuire, W.L.: What do we know about health care team
effectiveness? A review of the literature. Med. Care Res. Rev. 63(3), 263–300 (2006)
15. Eccles, M., Smith, J., Tanner, M., van Belle, J., van der Watt, S.: The impact of collocation
on the effectiveness of agile IS development teams. Commun. IBIMA 2010, 1–11 (2010)
16. Scheer, A.W., Brabänder, E.: The process of business process management. In: vom Brocke,
J., Rosemann, M. (eds.) Handbook on Business Process Management 2, pp. 239–265.
Springer, Heidelberg (2010)
17. Looy, A.V., Backer, M.D., Poels, G.: A conceptual framework and classification of
capability areas for business process maturity. Enterp. Inf. Syst. 8(2), 188–224 (2014)
18. Niehaves, B., Plattfaut, R., Becker, J.: Business process governance: a comparative study of
germany and Japan. Bus. Process Manag. J. 18(2), 347–371 (2012)
19. Rosemann, M.: The Service Portfolio of a BPM Center of Excellence. In: vom Brocke, J.,
Rosemann, M. (eds.) Handbook on Business Process Management 2, pp. 381–398. Springer,
Heidelberg (2015)
20. vom Brocke, J., Sinnl, T.: Culture in business process management: a literature review. Bus.
Process Manag. J. 17(2), 357–378 (2011)
21. Harmon, P.: Business Process Change: A Business Process Management Guide for
Managers and Process Professionals, 3rd edn. Morgan Kaufmann, Waltham (2014)
22. The Global Competitiveness Report 2015–2016. http://reports.weforum.org/globalcompetitiveness-report-2015-2016/economies/#indexId=GCI&economy=ZAF
23. Marshall, M.N.: Sampling for qualitative research. Fam. Pract. 13(6), 522–525 (1996)
24. Myers, M.D., Newman, M.: The qualitative interview in is research: examining the craft. Inf.
Organ. 17(1), 2–26 (2007)
25. Fereday, J., Muir-Cochrane, E.: Demonstrating rigor using thematic analysis: a hybrid
approach of inductive and deductive coding and theme development. Int. J. Qual. Methods 5
(1), 80–92 (2006)
26. Höhne, M., Schnägelberger, S., Dussuyer, N., Vogel, J. et al.: Business Process Management
Study (2015). http://www.bearingpoint.com/en/adaptive-thinking/insights/business-processmanagement-study-2015
Business Intelligence and Big Data
Measuring the Success of Changes to Existing
Business Intelligence Solutions to Improve
Business Intelligence Reporting
Nedim Dedić(&) and Clare Stanier
Faculty of Computing, Engineering and Sciences,
Staffordshire University, College Road, Stoke-on-Trent ST4 2DE, UK
[email protected],
[email protected]
Abstract. To objectively evaluate the success of alterations to existing Business Intelligence (BI) environments, we need a way to compare measures from
altered and unaltered versions of applications. The focus of this paper is on
producing an evaluation tool which can be used to measure the success of
amendments or updates made to existing BI solutions to support improved BI
reporting. We define what we understand by success in this context, we elicit
appropriate clusters of measurements together with the factors to be used for
measuring success, and we develop an evaluation tool to be used by relevant
stakeholders to measure success. We validate the evaluation tool with relevant
domain experts and key users and make suggestions for future work.
Keywords: Business intelligence
Technical functionality Reports
Measuring success
User satisfaction
1 Introduction
Improved decision-making, increased profit and market efficiency, and reduced costs
are some of the potential benefits of improving existing analytical applications, such as
Business Intelligence (BI), within an organisation. However, to measure the success of
changes to existing applications, it is necessary to evaluate the changes and compare
satisfaction measures for the original and the amended versions of that application. The
focus of this paper is on measuring the success of changes made to BI systems from
reporting perspective. The aims of this paper are: (i) to define what we understand by
success in this context (ii) to contribute to knowledge by defining criteria to be used for
measuring the success of BI improvements to enable more optimal reporting and (iii) to
develop an evaluation tool to be used by relevant stakeholders to measure success. The
paper is structured as follows: in Sect. 2 we discuss BI and BI reporting. Section 3
reviews measurement in BI, looking at end user satisfaction and technical functionality.
Section 4 discusses the development of the evaluation tool and Sect. 5 presents conclusions and recommendations for future work.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 225–236, 2016.
DOI: 10.1007/978-3-319-49944-4_17
N. Dedić and C. Stanier
2 Measuring Changes to BI Reporting Processes
Business Intelligence
BI is seen as providing competitive advantage [1–5] and essential for strategic
decision-making [6] and business analysis [7]. There are a range of definitions of BI,
some focus primarily on the goals of BI [8–10], others additionally discussing the
structures and processes of BI [3, 11–15], and others seeing BI more as an umbrella
term which should be understood to include all the elements that make up the BI
environment [16]. In this paper, we understand BI as a term which includes the
strategies, processes, applications, data, products, technologies and technical architectures used to support the collection, analysis, presentation and dissemination of
business information. The focus in this paper is on the reporting layer. In the BI
environment, data presentation and visualisation happens at the reporting layer through
the use of BI reports, dashboard or queries. The reporting layer is one of the core
concepts underlying BI [14, 17–25]. It provides users with meaningful operational data
[26], which may be predefined queries in the form of standard reports or user defined
reports based on self-service BI [27]. There is constant management pressure to justify
the contribution of BI [28] and this leads in turn to a demand for data about the role and
uses of BI. As enterprises need fast and accurate assessment of market needs, and quick
decision making offers competitive advantage, reporting and analytical support
becomes critical for enterprises [29].
Measuring Success in BI
Many organisations struggle to define and measure BI success as there are numerous
critical factors to be considered, such as BI capability, data quality, integration with
other systems, flexibility, user access and risk management support [30]. In this paper,
we adopt an existing approach that proposes measuring success in BI as “[the] positive
benefits organisations could achieve by applying proposed modification in their BI
environment” [30], and adapt it to consider BI reporting changes to be successful only
if the changes provide or improve a positive experience for users.
DeLone and McLean proposed the well-known D&M IS Success Model to measure
Information Systems (IS) success [31]. The D&M model was based on a comprehensive
literature survey but was not empirically tested [32]. In their initial model, which was
later slightly amended [33, 34], DeLone and McLean wanted to synthesize previous
research on IS success into coherent clusters. The D&M model, which is widely
accepted, considers the dimensions of information quality, system quality, use, user
satisfaction, organisational and individual aspect as relevant to IS success. The most
current D&M model provides a list of IS success variable categories identifying some
examples of key measures to be used in each category [34]. For example: the variable
category system quality could use measurements such as ease of use, system flexibility,
system reliability, ease of learning, flexibility and response time; information quality
could use measurements such as relevance, intelligibility, accuracy, usability and
completeness; service quality measurements such as responsiveness, accuracy,
Measuring the Success of Changes
reliability and technical competence; system use could use measurements such as
amount, frequency, nature, extend and purpose of use; user satisfaction could be
measured by single item or via multi-attribute scales; and net benefits could be measured
through increased sales, cost reductions or improved productivity. The intention of the
D&M model was to cover all possible IS success variables. In the context of this paper,
the first question that arises is which factors from those dimensions (IS success variables) can be used as measures of success for BI projects. Can examples of key measures
proposed by DeLone and McLean [33] as standard critical success factors (CSFs), be
used to measure the success of system changes relevant for BI reporting? As BI is a
branch of IS science, the logical answer seems to be, yes. However, to identify
appropriate IS success variables from the D&M model and associated CSFs we have to
focus on activities, phases and processes relevant for BI.
3 Measurements Relevant to Improve and Manage
Existing BI Processes
Measuring business performance has a long tradition in companies, and it can be useful
in the case of BI to perform activities such as determining the actual value of BI to a
company or to improve and manage existing BI processes [10]. Lönnqvist and Pirttimäki propose four phases to be considered when measuring the performance of BI:
(1) identification of information needs (2) information acquisition (3) information
analysis and (4) storage and information utilisation [10]. The first phase considers
activities related to discovering business information needed to resolve problems, the
second acquisition of data from heterogeneous sources, and the third analysis of
acquired data and wrapping them into information products [10]. The focus of this
paper is on measuring the impact of BI system changes to BI reporting processes,
meaning that the first three phases are outside the scope of the paper. Before decision
makers can properly utilise information by applying reporting processes, it has to be
adequately and timely communicated to the decision maker, making the fourth phase,
namely storage and information utilisation, relevant for this paper.
Storage and information utilisation covers how to store, retrieve and share
knowledge and information in the most optimal way, with business and other users, by
using different BI applications, such as queries, reports and dashboards. Thus, it covers
two clusters of measurements we identified as relevant: (i) business/end-users satisfaction, and (ii) technical functionality.
Business/End Users Satisfaction
User satisfaction is recognised as a critical measure of the success of IS [31, 33–42].
User satisfaction has been seen as a surrogate measure of IS effectiveness [43] and is
one of most extensively used aspects for the evaluation of IS success [28]. Data
Warehouse (DW) performance must be acceptable to the end user community [42].
Consequently, performance of BI reporting solutions, such as reports and dashboards,
needs to meet this criterion.
N. Dedić and C. Stanier
Doll and Torkzadeh defined user satisfaction as “an affective attitude towards a
specific computer application by someone who interacts with the application directly”
[38]. For example, by positively influencing the end user experience, such as
improving productivity or facilitating easier decision making, IS can cause a positive
increment of user satisfaction. On the other side, by negatively influencing the end user
experience, IS can lead to lower user satisfaction. User satisfaction can be seen as the
sum of feelings or attitudes of a user toward a numbers of factors relevant for a specific
situation [36].
We identified user satisfaction as one cluster of measurements that should be
considered in relation to the success of BI reporting systems, however, it is important to
define what is meant by user in this context. Davis and Olson distinguished between two
user groups: users making decisions based on output of the system, and users entering
information and preparing system reports [44]. According to Doll and Torkzadeh [38]
end-user satisfaction in computing can be evaluated in terms of both the primary and
secondary user roles, thus, they merge these two groups defined by Davis and Olson into
We analysed relevant user roles in eight large companies, which utilise BI, and
identified two different user roles that actually use reports to make their business
decisions or to achieve their operational or everyday activities: Management and
Business Users. Those roles are very similar to groups defined by Davis and Olson.
Management uses reports and dashboards to make decisions at enterprise level.
Business users use reports & dashboards to make decisions at lower levels, such as
departments or cost centres, and to make operational and everyday activities, such as
controlling or planning. Business users are expected to control the content of the
reports & dashboards and to require changes or correction if needed. They also
communicate Management requirements to technical personnel, and should participate
in BI Competency Centre (BICC) activities. Business users can also have a more
technical role. In this paper, we are interested in measuring user satisfaction in relation
to Business users.
Measuring User Satisfaction. Doll and Torkzadeh developed a widely used model to
measure End User Computer Satisfaction (EUCS) that covers all key factors of the user
perspective [38, 40]. The model to measure end user computer satisfaction included
twelve attributes in the form of questions covering five aspects: content, accuracy,
format, ease of use and timeliness. This model is well validated and has been found to
be generalizable across several IS applications; however, it has not been validated with
users of BI [40].
Petter et al. [34] provide several examples of measuring user satisfaction aspects as a
part of IS success based on the D&M IS Success Model [34]. According to them, we
can use single items to measure user satisfaction, semantic differential scales to assess
attitudes and satisfaction with the system, or multi-attribute scales to measure user
information satisfaction. However, we face three issues when considering this approach
in the context of evaluating user satisfaction concerning changes to BI reporting systems. First is the fact that the discussion is about methods of measuring, rather than
relevant measurements. The second issue is that this approach is designed for IS rather
than the narrower spectrum of BI. The third issue is that this approach does not identify
Measuring the Success of Changes
explicit measurements to be used to validate success when changes are made to BI
reporting systems. Considering the D&M model in the context of this paper, we
identify ease of use and flexibility as the measures of system quality possibly relevant
when measuring user satisfaction.
In the Data Warehouse Balanced Scorecard Model (DWBSM), user perspective
based on user satisfaction with data quality and query performance is defined as one of
four aspects when measuring the success of the DW [42]. DWBSM considers data
quality, average query response time, data freshness and timeliness of information per
service level agreement as key factors in determining user satisfaction. As DW are at
the heart of BI systems [1, 47], those factors are relevant to evaluating the success of
changes to BI reporting but are not comprehensive enough as they cover only one part
of a BI system.
To develop a model for the measurement of success in changes to BI reporting
systems, we combined elements from different approaches, cross tabulating the aspects
and attributes of the EUCS model with the phases to be considered when measuring
performance of BI discussed in Sect. 3. Table 1 shows the initial results of the cross
tabulation with areas of intersection marked with ‘x’, and where each number represents a phase to be considered when measuring performance of BI proposed by
Lönnqvist and Pirttimäki. The questions shown in Table 1 were later modified following feedback, as discussed in Sect. 4.
As discussed in Sect. 3, only the storage and information utilisation phase (marked
with number 4 in Table 1) from the Lönnqvist and Pirttimäki approach is relevant
when measuring the success of changes to BI reporting systems to enable more
Table 1. Cross-tabulation of EUCS attributes and phases of measuring BI performance
EUCS aspects and their attributes [38]
Ease of
Does the system provide the precise information you
Does the information content meet your needs?
Does the system provide reports that seem to be just about
exactly what you need?
Does the system provide sufficient information?
Is the system accurate?
Are you satisfied with the accuracy of the system?
Dou you think the output is presented in a useful format?
Is the information clear?
Is the system user friendly?
Is the system easy to use?
Do you get the information you need in time?
Does the system provide up-to-date information?
Phases of
measuring BI
performance [10]
N. Dedić and C. Stanier
optimal reporting. Based on the analysis given in Table 1, it is possible to extract a list
of attributes (questions) to be used as user satisfaction measurements. We extracted
eight key measures and modified these for use in the BI context. The elements identified from the EUCS model were extended to include three additional questions related
to changing descriptive content (CDS) of BI reports. Descriptive content of the reports
can include, but is not limited to, descriptions of categories, hierarchies or attributes,
such as product, customer or location names descriptions. The most common cause of
such requests for changes to descriptive content are errors in the descriptions and CDS
issues are common with large and rapidly changing dimensions [47].
Table 2 presents the questions developed from these measures, which were later
revised following feedback during the initial phase of validation.
The design of the questions supports both an interview-based approach and a
quantitative survey based approach. However, using only user satisfaction criteria is
not sufficient to measure the success of modifications to reporting systems.
Table 2. User satisfaction questions to measure success of improving existing BI system
Does the information content of the reports meet your needs?
Are the BI system and reports accurate?
Are you satisfied with the accuracy of the BI system and the associated reports?
Do you think the output is presented in a useful format?
Are the BI system and associated reports user friendly?
Are the BI system and associated reports easy to use?
Do you get the information you need in time?
Do the BI system and associated reports provide up-to-date information?
Are you satisfied with the changing descriptive content (CDS) functionality?
Is the BI system flexible enough regarding CDS functionality?
Is CDS functionality fast enough to fulfil business requirements in a timely fashion?
Technical Functionality
In Sect. 2, we identified technical functionality as the second cluster of measurements
that need to be considered when measuring the success of changes to BI reporting
systems. To initiate and manage improvement activities for specific software solutions,
it has been suggested that there should be sequential measurements of the quality
attributes of product or process [48].
Measuring Technical Functionality. In the DWBSM approach, the following technical key factors are identified: ETL code performance, batch cycles runtime, reporting
& BI query runtime, agile development, testing and flawless deployment into production environment [42]. We identify reporting & BI query runtime as relevant in the
context of BI reporting. From the D&M IS success model, we extract the response time
measure from the system quality cluster of IS success variables. Reporting and BI query
runtime and response time both belong to the time category although they are differently named. However, to measure the technical success of modifications to BI
Measuring the Success of Changes
reporting solutions, it is not enough to conclude that we only need to measure the time.
We need a clear definition and extraction of each relevant BI technical element
belonging to the time and other technical categories that should be evaluated. Table 3
shows the extracted time elements and includes elements related to memory use and
technical scalability.
Table 3. Technical measurements of success to improve existing BI system
Initial BI report or dashboard execution time
Query execution time
Re-execution time when changing report language, currency or unit
Time required to change erroneous descriptions of descriptive attributes/hierarchies
Database memory consumption
CPU memory usage during execution of: (a) Initial BI report or dashboard; (b) Query;
(c) Re-execution of report when changing language, currency or unit;
Technical scalability and support for integration of proposed solution in regard to
existing environment
Flexibility and extensibility in regard to possible extension of the system in the future
Is the BI system flexible enough regarding CDS functionality?
Is CDS functionality fast enough to fulfil business requirements in a timely fashion?
4 Producing an Evaluation Tool to Measure Success
of Changing BI Environment
As discussed in Sect. 3, we elicited two clusters of measurements for use when
evaluating the success of changes to BI reporting systems. The measurements identified
in the user satisfaction and in technical functionality clusters are intended to be
recorded at two stages: (i) in the existing BI environment - before implementing any
changes, and (ii) after modification of existing BI system - in a new environment. By
comparing their values, the result from both stages can then be used to evaluate the
success of changes to the BI reporting system.
To produce a tool for use by relevant stakeholders, we merged both clusters of
measurements into one and developed a questionnaire like evaluation tool. We conducted a pilot survey with 10 BI domain experts and report users. Based on the
responses received, the questions shown in Table 2 were amended; questions 2 and 3
were merged, we amended questions 5 and 6 and we removed question 9 as surplus.
We also added one additional question identified as highly important by business users
relating to the exporting and sharing of content functionality. We added one additional
technical question, relating to speed of execution time when drilling-down, conditioning, removing or adding columns in reports. The final list of factors is shown in
Table 4.
We validated the proposed factors by carrying out a survey with 30 key users
working in the BI field. All users were asked to complete the user satisfaction element
of the survey. However, technical functionality factors are arguably comprehensible
N. Dedić and C. Stanier
Table 4. Survey results based on Likert-type items
Cluster of measurements
Business users
User satisfaction
- Information content meets your
- The information provided in the
reports is accurate?
- Output is presented in a format that
you find useful?
- The system and associated reports are
easy for you to use?
- Information in the reports is up to
- Reports have the functionality that
you require?
- The BI system is flexible enough to
support easy change of “descriptive
- Is the change of ``descriptive content”
fast enough to fulfil business
- Exporting and sharing content
functionalities meet your needs?
Technical functionality
- Speed of execution time for Initial BI
report or dashboard
- Speed of execution time for SQL
- Speed of re-execution time when
changing report language, currency or
- Speed of execution time when
drilling-down conditioning, removing
or adding columns in reports
- Amount of Time required to change
erroneous descriptions of descriptive
attributes and hierarchies
- Database memory consumption
- CPU memory usage curing execution
of initial BI report or dashboard
- CPU memory usage curing execution
of SQL query
- CPU memory usage during
re-execution of report when changing
language, currency or unit
- Technical scalability of proposed
solution in the existing environment
- Support for possible extension of the
system in the future
Nr. Mode Median Nr. Mode Median Nr. Mode Median
16 5
14 5
30 5
Technical users
All us
Measuring the Success of Changes
and relevant only for technical users; thus, answering this part of survey was optional
and dependent on the respondent’s expertise.
As we had series of questions and statements which needed to be validated, a Likert
scale [45] was used, scoring each factor on a scale of 1 – 5 (where 1 is less important
and 5 is most important). In the original Likert scale approach, responses are combined
to create an attitudinal measurement scale, thus performing data analysis on the
composite score from those responses [46]. However, our intention was to score each
individual question or statement separately and to examine the views of users regarding
each separate factor. We therefore used the concept of Likert-type items that supports
using multiple questions as a part of the research instrument, but without combining the
responses into composite values [46, 49]. Likert-type items fall into the ordinal measurement scale; thus mode or median are recommended to measure central tendency
[46]. The results of our survey are presented in Table 4, and are grouped into two
clusters of measurements, namely user satisfaction and technical functionality, where
each contains individual factors.
As we see from Table 4, no single question relevant to user satisfaction had mode
or median less than 4, indicating that each question was considered important. No
single technical factor had mode or median less than 3, showing a strong tendency
towards considering each technical factor important. As expected, a larger percentage
of users with a greater technical role commented on technical aspects than users with a
greater business orientation. Users with a greater business orientation rated user satisfaction questions as more important than users with a greater technical role, and the
same effect was found in relation to users with a greater technical role commenting on
technical functionality factors.
A free text question allowed survey respondents to suggest additional factors and
this identified two additional questions that could be relevant to the measurement of
user satisfaction:
– Description of the key figures is available, sufficient and easy accessible via BI
– Functionality allowing further consolidation of existing information is available in
BI reports?
It also elicited one additional factor that could be used to measuring technical
– How platform independent are BI reports (able to run on any PC, OS, Laptop or
mobile advice)?
However, those three additional factors were not validated in the same way as the
factors listed in Table 4, thus, we do not include them and propose Table 4 as the core
evaluation tool. An advantage of the approach is that the tool can be customised and
additional factors added by stakeholders, meaning that the additional features identified
in the survey could be added by users if required.
The proposed tool is limited on reporting aspect in BI and on business user
group. Possible extension would include consideration to the views of other user
groups, such as conceptual or organizational. The tool focuses on changes to support BI
reporting and is not suitable for use to measure success of changes in regard to data
N. Dedić and C. Stanier
warehousing, data acquisition or data modelling aspects. The tool would be easier to
use if provided as a web based tool.
The tool discussed in this paper provides a mechanism for measuring the success of
changes made to reporting in BI systems. The use of the tool could be extended beyond
evaluation of changes to BI reporting systems and could be used as a general benchmarking tool when evaluating different BI software from the reporting aspect. For
example, business and especially key BI users could use proposed tool to benchmark
and select the most suitable existing BI software for implementation in their organisation. The approach used here could also be extended for use with other elements such
as the impact of changes in data warehousing, data acquisition or data modelling
5 Conclusions and Future Work
The focus of this paper was on measuring the success of new approaches to changing
and improving existing BI solutions to enable more optimal BI reporting. Consequently, we explained BI and defined what we understand by success in terms of
changes to BI reporting, we elicited appropriate clusters, including criteria to be used
for measuring such success and developed an evaluation tool to be used by relevant
stakeholders to measure success. Finally, using a preliminary and a further survey we
validated our finding with relevant domain expert and key users. Future work will
consist of using the evaluation tool in a real world environment to measure success
when amending BI systems to improve BI reporting. This will allow evaluation of the
tool on a case study basis.
1. Olszak, C.M., Ziemba, E.: Business intelligence systems in the holistic infrastructure
development supporting decision-making in organisations. Interdiscip. J. Inf. Knowl.
Manag. 1, 47–58 (2006)
2. Marchand, M., Raymond, L.: Researching performance measurement systems: an information systems perspective. Int. J. Oper. Prod. Manag. 28(7), 663–686 (2008)
3. Brannon, N.: Business intelligence and e-discovery. Intellect. Prop. Technol. Law J. 22(7),
1–5 (2010)
4. Alexander, A.: Case studies: business intelligence. Account. Today 28(6), 32 (2014)
5. Thamir, A., Poulis, E.: Business intelligence capabilities and implementation strategies. Int.
J. Glob. Bus. 8(1), 34–45 (2015)
6. Popovič, A., Turk, T., Jaklič, J.: Conceptual model of business value of business intelligence
systems. Manag.: J. Contemp. Manag. 15(1), 5–29 (2010)
7. Kurniawan, Y., Gunawan, A., Kurnia, S.G.: Application of business intelligence to support
marketing strategies: a case study approach. J. Theor. Appl. Inf. Technol. 64(1), 214 (2014)
8. Luhn, H.P.: A business intelligence system. IBM J. Res. Dev. 2(4), 314–319 (1958)
9. Power, D.J.: Decision Support Systems: Concepts and Resources for Managers. Greenwood
Publishing Group, Westport (2002)
Measuring the Success of Changes
10. Lönnqvist, A., Pirttimäki, V.: The measurement of business intelligence. Inf. Syst. Manag.
23(1), 32–40 (2006)
11. Moss, L.T., Atre, S.: Business Intelligence Roadmap: The Complete Project Lifecycle for
Decision-support Applications. Addison-Wesley Professional, Boston (2003)
12. Golfarelli, M., Rizzi, S., Cella, I.: Beyond data warehousing: what’s next in business
intelligence? In: Proceedings of the 7th ACM International Workshop on Data Warehousing
and OLAP, pp. 1–6. ACM Press, New York (2004)
13. Dekkers, J., Versendaal, J., Batenburg, R.: Organising for business intelligence: a framework
for aligning the use and development of information. In: BLED 2007 Proceedings, Bled,
pp. 625–636 (2007)
14. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., Becker, B.: The Data Warehouse
Lifecycle Toolkit, 2nd edn. Wiley, Indianapolis (2008)
15. Jamaludin, I.A., Mansor, Z.: Review on business intelligence “BI” success determinants in
project implementation. Int. J. Comput. Appl. 33(8), 24–27 (2011)
16. Turban, E., Sharda, R., Delen, D., King, D.: Business Intelligence: A Managerial Approach,
2nd edn. Prentice Hall, Upper Saddle River (2010)
17. Inmon, B.W.: Building the Data Warehouse, 4th edn. Wiley, Indianapolis (2005)
18. Watson, H.J., Wixom, B.H.: The current state of business intelligence. Computer 40(9), 96–
99 (2007)
19. Baars, H., Kemper, H.-G.: Management support with structured and unstructured data—an
integrated business intelligence framework. Inf. Syst. Manag. 25(2), 132–148 (2008)
20. Ranjan, J.: Business intelligence: concepts, components, techniques and benefits. J. Theor.
Appl. Inf. Technol. 9(1), 60–70 (2009)
21. Gluchowski, P., Kemper, H.-G.: Quo vadis business intelligence? BI-Spektrum 1, 12–19
22. Chu, T.-H.: A framework for BI systems implementation in manufacturing. Int. J. Electron.
Bus. Manag. 11(2), 113–120 (2013)
23. Anadiotis, G.: Agile business intelligence: reshaping the landscape, p. 3 (2013)
24. Obeidat, M., North, M., Richardson, R., Rattanak, V., North, S.: Business intelligence
technology, applications, and trends. Int. Manag. Rev. 11(2), 47–56 (2015)
25. Imhoff, C., Galemmo, N., Geiger, J.G.: Mastering Data Warehouse Design: Relational and
Dimensional Techniques. Wiley Publishing, Inc., Indianapolis (2003)
26. Mykitychyn, M.: Assessing the maturity of information architectures for complex dynamic
enterprise systems. Georgia Institute of Technology (2007)
27. Rajesh, R.: Supply Chain Management for Retailing. Tata McGraw-Hill Education, Kalkota
28. Sedera, D., Tan, F.T.C.: User satisfaction: an overarching measure of enterprise system
success. In: PACIS 2005 Proceedings, vol. 2, pp. 963–976 (2005)
29. Olszak, C.M., Ziemba, E.: Critical success factors for implementing business intelligence
systems in small and medium enterprises on the example of Upper Silesia, Poland.
Interdiscip. J. Inf. Knowl. Manag. 7(2012), 129 (2012)
30. Işik, Ö., Jones, M.C., Sidorova, A.: Business intelligence success: the roles of BI capabilities
and decision environments. Inf. Manag. 50(1), 13–23 (2013)
31. DeLone, W.H., McLean, E.R.: Information systems success: the quest for the dependent
variable. Inf. Syst. Res. 3(1), 60–95 (1992)
32. Sabherwal, R., Chowa, C.: Information system success: individual and organisational
determinants. Manag. Sci. 52(12), 1849–1864 (2006)
33. DeLone, W.H., McLean, E.R.: The DeLone and McLean model of information systems
success: a ten-year update. J. Manag. Inf. Syst. 19(4), 9–30 (2003)
N. Dedić and C. Stanier
34. Petter, S., DeLone, W., McLean, E.: Information systems success: the quest for the
independent variables. J. Manag. Inf. Syst. 29(4), 7–61 (2013)
35. Powers, R.F., Dickson, G.W.: MIS project management: myths, opinions, and reality. Calif.
Manag. Rev. 15(3), 147–156 (1973)
36. Bailey, J.E., Pearson, S.W.: Development of a tool for measuring and analyzing computer
user satisfaction. Manag. Sci. 29(5), 530–545. 37 (1983)
37. Ives, B., Olson, M., Baroudi, J.: The measurement of user information satisfaction.
Commun. ACM 26(10), 785–793 (1983)
38. Doll, W.J., Torkzadeh, G.: The measurement of end-user computing satisfaction. MIS Q. 12
(2), 259–274 (1988)
39. Davison, J., Deeks, D.: Measuring the potential success of information system implementation. Meas. Bus. Excell. 11(4), 75–81 (2007)
40. Chung-Kuang, H.: Examining the effect of user satisfaction on system usage and individual
performance with business intelligence systems: an empirical study of Taiwan’s electronics
industry. Int. J. Inf. Manag. 32(6), 560–573 (2012)
41. Dastgir, M., Mortezaie, A.S.: Factors affecting the end-user computing satisfaction. Bus.
Intell. J. 5(2), 292–298 (2012)
42. Rahman, N.: Measuring performance for data warehouses-a balanced scorecard approach.
Int. J. Comput. Inf. Technol. 4(2), 1–6 (2013)
43. Gatian, A.W.: Is user satisfaction a valid measure of system effectiveness? Inf. Manag. 26(3),
119–131 (1994)
44. Davis, G.B., Olson, M.H.: Management Information Systems: Conceptual Foundations,
Structure, and Development, 2nd edn. McGraw-Hill, Inc., New York City (1985)
45. Likert, R.: A technique for the measurement of attitudes. Arch. Psychol. 22(140), 5–55
46. Boone, H.N.J., Boone, D.: Analyzing Likert data. J. Ext. 50(2), 30 (2012)
47. Dedić, N., Stanier, C.: An evaluation of the challenges of multilingualism in data warehouse
development. In Proceedings of the 18th International Conference on Enterprise Information
Systems (ICEIS 2016), Rome, Italy, pp. 196–206 (2016)
48. Florak, W.A., Park, R.E., Carleton, A.: Practical Software Measurement: Measuring for
Process Management and Improvement, 1st edn. Software Engineering Institute, Carnegie
Mellon University, Pittsburgh (1997)
49. Clason, D.L., Dormody, T.J.: Analyzing data measured by individual Likert-type items.
J. Agric. Educ. 35(4), 31–35 (1994)
An Architecture for Data Warehousing in Big
Data Environments
Bruno Martinho and Maribel Yasmina Santos(&)
ALGORITMI Research Centre, University of Minho, Guimarães, Portugal
Abstract. Recent advances in Information Technologies facilitate the increasing capacity to collect and store data, being the Big Data term often mentioned.
In this context, many challenges need to be addressed, being Data Warehousing
one of them. In this sense, the main purpose of this work is to propose an
architecture for Data Warehousing in Big Data, taking as input a data source
stored in a traditional Data Warehouse, which is transformed into a Data
Warehouse in Hive. Before proposing and implementing the architecture, a
benchmark was conducted to verify the processing times of Hive and Impala,
understanding how these technologies could be integrated in an architecture
where Hive plays the role of a Data Warehouse and Impala is the driving force
for the analysis and visualization of data. After the proposal of the architecture,
it was implemented using tools like the Hadoop ecosystem, Talend and Tableau,
and validated using a data set with more than 100 million records, obtaining
satisfactory results in terms of processing times.
Keywords: Big data
Data warehouse NoSQL Hadoop Hive Impala
1 Introduction
Nowadays, due to the high competitiveness that exists between organizations, they
need to invest more and more in technology. Usually, the cause of this need involves
the frequent change of the business trends as well as their customers’ habits [1]. Data
Warehouse and On-line Analytical Processing (OLAP) are technologies that have been
following this evolution to the present day [1], being a Data Warehouse a database to
support analytical processing and to assist in decision making process [2]. The
implementation of these systems usually occurs in relational databases that may not be
able to store and process large volumes of data [3].
With the recent technological advances, organizations are collecting more and more
data, with different types, formats and speeds. When used and analyzed in the proper
way these data have enormous potential, enabling organizations to completely change
their business systems for better results [4]. Transforming the potential of the information, in this increasingly digital world, requires not only new data analysis algorithms, but also a new generation of systems and distributed computing environments
to deal with the sharp increase in the volume of data and its lack of structure [5]. The
challenge is to enhance the value of these data, as these are sometimes in completely
different formats [6]. Combining the large amounts of data with the need to analyze
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 237–250, 2016.
DOI: 10.1007/978-3-319-49944-4_18
B. Martinho and M.Y. Santos
them, there is a need to think the role of Data Warehousing in the context of Big Data,
being Big Data the ability to collect, store and process large volumes of data [4]. Big
Data refers mainly to the massive amounts of unstructured data produced by
high-performance applications [7], but also data that arrive in structured and
semi-structured formats [8]. The Big Data solutions are ideal for data analysis from a
variety of sources [9], which characteristics (volume, velocity, variety) make Big Data
a major challenge for organizations that still use the traditional mechanisms for data
Given this context, the question is where to store these massive amounts of data for
analytical purposes and the data models that must be used. Regarding storage, the
movement called NoSQL (Not Only SQL) promotes many innovative solutions for the
storage and processing of large volumes of data [10]. These databases, usually, do not
provide any guidelines on how to model and implement a Data Warehouse for Big
Data contexts. Regarding specific technologies, Hadoop Data File System (HDFS) and
Hive [11] for storage and Impala [12] for processing are frequently mentioned.
So far, the development of Data Warehousing in Big Data was guided by use-case
driven approaches in which specific technologies and implementation contexts are
proposed and tested to solve some specific problems [13]. Although these relevant
works provide useful guidelines on how to proceed, they do not envisage the proposal
of a generic architecture that complement storage and processing technologies for the
efficient implementation of a Big Data Warehouse.
For achieving this aim, this work benchmarks Hive and Impala for data processing,
while Hive is used as the Big Data Warehouse repository [12, 14]. The knowledge
obtained from the performed benchmark is crucial for proposing an architecture for the
implementation of Big Data Warehouses, which was tested in a demonstration case that
stored and processed more that 100 million records.
This paper is organized as follows. Section 2 presents the related work. Section 3
summarizes the main findings in the benchmark performed to compare the performance
of Hive and Impala for data processing. Section 4 describes the proposed architecture,
while Sect. 5 shows some results from its implementation. Section 6 concludes with
some remarks and guidelines for future work.
2 Related Work
Being Big Data a recent research topic, there is no common approach on how to design
and implement a Big Data Warehouse. Many authors discuss this need and propose
works that are mainly guided by use-case driven approaches, where specific solutions are
recommended and tested, mainly giving non-structured guidelines on how to design Big
Data Warehouses and, mostly, revisiting traditional modeling techniques. However, the
traditional logical data models used in the implementation of Data Warehouses do not fit
into a featured environment of large amounts of data as in Big Data and, so, repositories
like NoSQL and Hadoop are the most recent bets in data storage [2], providing infrastructures for implementing Data Warehouses and multidimensional data structures in the
form of OLAP data cubes [15]. In the work of [4], the need for redesigning traditional
Data Warehouses, in order to address new challenges like data types, data volume, user
An Architecture for Data Warehousing in Big Data Environments
requirements and performance is focused. Moreover, this author mentions that Big Data
Warehouses need to include data from several sources and must be implemented making
use of multiple technologies like relational database management systems, Hadoop,
NoSQL databases, reporting and visualization, among others.
From the technological point of view, many technologies have been proposed,
mainly in what concerns storage, being NoSQL databases the most noticeable example,
with more than 225 NoSQL databases already proposed, as reported in http://nosqldatabase.org. From the data modeling point of view, very specific approaches have
been followed, mainly driven by very specific data requirements scenarios. In NoSQL
databases, as logical data models are schema-free, meaning that different rows in a table
may have different data columns (less rigid structures) or that the defined schema may
change on runtime, the definition of data schemas follows a different approach [16].
Instead of reflecting the relevant entities in a particular domain and the relationships
between those entities, data schemas are defined considering the queries that need to be
answered, being data replicated as many times as needed [17], given the importance of
query performance when huge volumes of data are being processed [18].
The transformation of traditional data models into data models for NoSQL databases mainly follows two types of repositories, based on columns and on documents.
The authors in [2, 3] propose an approach for mapping a conceptual model of a
traditional data environment into a logical data model in HBase and MongoDB for data
storage in a distributed environment [19]. In these works, the authors use columns and
documents oriented databases as data storage areas without the integration of Hive. In
this sense, and as the Hive is considered the Data Warehouse in the context of Big
Data, because of its analytical operators, the databases used by those authors do not
make available different analytical perspectives on the data.
In another work, [18], the authors recognize that the design of big data warehouses is
very different from traditional data warehouses, as their schema should be based on novel
logical models allowing more flexibility than the relational model does. The authors
propose a design methodology for the representation of a multidimensional schema at the
logical level based on the key-value model. In this approach, a data-driven approach
design, using data repositories as main source of information, and a requirements-driven
approach design, using information of the decision makers, are integrated.
Given this overall context, this work proposes an architecture that uses the Hadoop
Data File Systems (HDFS) as staging area and Hive as a Data Warehouse. For defining the
Data Warehouse logical model, a set transformation rules are used [13], deriving a tabular
data model for Hive, taking into consideration a multidimensional data model with the
data requirements for a specific data analytics scenario. These rules [13] provide as output
a set of tables with different analytical perspectives on data, imitating the on-line analytical processing cubes normally used in traditional Business Intelligence environments.
In order to improve the performance of the proposed architecture, both in the
ETL/ELT and in the analysis and visualization of data, a benchmark was performed to
verify how Hive and Impala perform. Impala is tested to verify how it performs when
analyzing data that is stored in Hive. According to [20], Impala is faster in querying the
data when compared to Hive, as it uses a query engine that does not need MapReduce
[20, 21] and, as Hive uses MapReduce jobs, its performance is slower than the performance of Impala [21].
B. Martinho and M.Y. Santos
However, in some of these scenarios, where Impala and Hive are compared, the
performance of Impala was not analyzed with the data stored in Hive, like this work
proposes, where Impala is only acting as a querying mechanism, and not as a data
storage repository with tables that enhance data analytics over different perspectives. In
this work, the performed benchmark does not use a simple table with columns in the
Hive, but organizes the data also using partitions and buckets, stored in the parquet
format and using the snappy compression. In this sense, besides verifying the performance of Hive and Impala, another objective was to verify if Impala is able to use the
same data types as Hive, if interprets the partitions and buckets correctly and, also, if
uses the compression formats. Summarizing, the objective of the benchmark was not
only to verify performance, but also the combination of these two technologies, as they
can be used as complementary technologies instead of competitors in querying data.
3 Benchmarking Hive and Impala
The study of technologies that can be integrated with Hive, allowing better querying
performance, is important in this work, and Impala emerged in this direction.
According to [21], Impala emerged as an addition for querying Hive tables, which can
be faster than the Hive mechanism itself. Also, the authors mention that Hive is more
suitable for storing and processing large amounts of data in batch and Impala for
processing in real time. In this work, Hive remains as a data storage mechanism in the
form of Data Warehousing and uses its querying component for the creation, aggregation and transformation of data for the Hive tables itself, being more convenient for
ELT processes than Impala [20]. Once the tables are created and the data is stored,
Impala is used as a query engine to analyze the information in several dashboards.
The integration between Hive and Impala can be achieved through the use of the
metastore, where all metadata associated to the Hive tables is stored [22]. Given this
context, Fig. 1 shows how these two technologies can be integrated, having as
Fig. 1. Integration between Hive and Impala (Source: adapted from [22])
An Architecture for Data Warehousing in Big Data Environments
implementation environment the Hadoop ecosystem. Considering the benchmarks
already mentioned, an appropriate solution suggests the integration these two technologies, taking advantage of their characteristics, compatibility and performance.
In order to achieve better performance when processing Hive tables, those must be
stored in the parquet format [20]. This is a columnar format that best suits querying
either in Hive and Impala, taking into consideration CPU and memory consumption
[20]. Moreover, the snappy compression method can be used for reducing the size of
the data by half or more, relieving the IO pressure [20].
Table 1. Defined queries for the benchmark
B. Martinho and M.Y. Santos
For understanding the contexts in which Impala querying processing is better than
Hive querying processing, for data stored in Hive, a set of queries were defined and
executed. The dataset used in this benchmark includes more than 100 million records
associated with flights in the USA [23]. The used technological infrastructure includes
a virtual machine Intel Core i5-2430 CPU 240 GHz with the Centos 6.4 operating
system, 6 GB of RAM and 100 GB SSD. This machine has Hadoop based on a single
node cluster.
Table 1 shows the five queries that were defined to test the performance of Impala
and Hive. Those are single queries over a table, including some aggregation functions
and selection or grouping conditions. The dataset used in this benchmark is better
explained in Sect. 4.2.
The obtained results, in terms of processing times, for the five queries presented in
Table 1, are shown in Fig. 2. As can be seen, Impala had better results when compared
with Hive in querying the data. The difference was more than 30 s in each query, which
represents a significant improvement. For setting the needed processing time, each
query was run three times, and the average of these three results was taken as the time
that is shown in Fig. 2.
Fig. 2. Benchmark of Hive and Impala (time in seconds)
4 An Architecture for Big Data Warehousing
Overall Overview
The proposed architecture makes use of multiple technologies such as HDFS for
storing facts and dimension tables in different files (staging area); Hive to act as a Data
Warehouse, containing the final data set for the data analytics and visualization tasks;
An Architecture for Data Warehousing in Big Data Environments
Impala for querying the Hive tables (giving the results of the performed benchmark);
Talend Open Studio for Big Data, which is responsible for all data flows and ETL/ELT
(Extract, Transform, Load)/(Extract, Load, Transform) processes; and, finally, Tableau
(www.tableau.com) as the tool for the implementation of analytical dashboards.
As can be seen in Fig. 3, this architecture considers that an organization can have a
traditional Data Warehousing environment that needs to be migrated to a Big Data
environment, or that this Data Warehouse does not exist. In this case, the operational
data sources can be used to feed the Big Data Stating Area, which will support the
loading of the Big Data Warehouse.
Fig. 3. An architecture for Data Warehousing in Big Data
In this work, we are considering that a traditional Data Warehouse exist,
showing how organizations can move to a Big Data context using the organizational
knowledge and the corresponding logical data models that guided the concretization
of an analytical environment. Although this is not mandatory, this approach helps
in setting the logical data model for the Big Data Warehouse, as all the data requirements are available in the multidimensional data model of the traditional Data
The analysis of Fig. 3 shows the several considered components, already mentioned, and the data flows among them. The 1st data flow consists in the ETL of
operational data sources (that can be in different formats) to the traditional Data
Warehouse, considering the defined logical data model (with the dimension and fact
tables). The 2nd data flow includes the ELT of dimension and facts tables, stored in the
traditional Data Warehouse, to HDFS (where each table is stored in a different file).
In case the traditional Data Warehouse does not exist, the operational data sources
can follow the same path, being stored in HDFS in different files. HDFS is used as a
staging area in the implementation of the Big Data Warehouse in Hive. The 3rd data
flow is needed for feeding the Hive tables in an ELT process that stores each file
present in HDFS in a Hive table. In the scenario recommended in this work, each file
B. Martinho and M.Y. Santos
corresponds to a dimension or fact table. Once in Hive, these tables are used to perform
a set of transformations that leads to new tables, optimized for query processing, as
they integrate dimension and fact tables in a way that imitate an analytical cube for
analysis and visualization tasks in decision making contexts. These transformations are
better explained in the following subsection.
The 4th data flow is the querying engine that uses Impala for querying the Hive
tables, feeding the analytical dashboards used in data analysis and visualization tasks.
In technological terms, Tableau is connected to Impala, which interacts with the Hive
metastore, querying the data available in the Big Data Warehouse.
Logical Data Model for a Big Data Warehouse
In Big Data Contexts, logical data models are usually defined attending to the queries
that need to be answered. In this work, we use the proposal of [13] for setting the
logical data model of a Big Data Warehouse, as the authors propose a set of rules that
automatically transform a multidimensional data model in a tabular model suited to be
implemented in Hive. This approach has as advantage the use of the data and analytical
requirements identified in the multidimensional data model, guiding the process of
implementation of a Big Data Warehouse.
The approach proposed by [13] allows the identification of a complete set of tables
that imitate the way how analytical data cubes perform in traditional Business Intelligence contexts. The approach combines different dimension and fact tables, integrating them in Hive tables, providing the different analytical perspectives.
As it is not possible (also, it is not the objective) in this paper to explain all these
rules in detail, and show all the transformations, this work uses two of the obtained
Hive tables for demonstration purposes. The transformation process was started taking
into consideration a multidimensional model that includes 7 dimension tables (Calendar, Time, Airport, Flight, Carrier, Airplane, Cancellation) and two fact tables
(Flights, Delays), for storing and analyzing data about commercial flights in the USA.
Following the rules in [13], a set of 127 different Hive tables can be identified, being
able to answer all possible questions for this dataset. These tables include different data
granularities, meaning different levels of detail, being possible to choose the appropriate ones for specific analytical contexts or choose some of the more detailed ones
and use them to obtain more aggregated data. This is possible when analytical tools like
Tableau are used for analyzing the more detailed Hive tables, as the defined aggregation functions will allow data summarization.
Without giving too much detail, Fig. 4 briefly shows how the combination of
dimension and fact tables can be achieved in the transformation process to derive the
Hive tables. As can be understood, different combinations would lead to different Hive
tables, either in terms of the available attributes for data analysis, either in the level of
detail of each table.
From this set of tables, Figs. 5 and 6 show how the two that were selected are
constituted, in terms of columns. The columns can be descriptive (those that are
inherited from dimension tables) and analytical (those that are inherited from fact
tables). Due to the extensive number of business indicators available in the
An Architecture for Data Warehousing in Big Data Environments
Fig. 4. Transformation process for driving the Hive tables
multidimensional data model used as source of the data and analytical requirements,
both figures present only a subset of the available metrics.
The aggDelays table characterizes the delays of the several flights, considering the
several locations in terms of airports of departure and arrival, the carrier, the calendar
dimension, and the time of the day in which the flight took place. As business indicators, the total delay in minutes, the delay considering several reasons (security,
company, weather, …) and the number of delayed flights, among other metrics, are
considered. The time component is divided in time intervals ([00:00–01:00[, [01:00–
02:00[, and so on).
As can be seen in Fig. 5, the analytical columns derived from the fact table include,
for each attribute, an aggregation function that can allow the summarization of the
dataset (this will depend on the level of detail of each descriptive column regarding the
analytical columns).
Fig. 5. aggDelays Hive table of the Big Data Warehouse
The aggFlights table (Fig. 6) includes the information about all the flights, but here
aggregated attending to the airports (origin and destination), the carrier, the airplane
and the cancellation type (in case of flight cancellation). As business indicators the
B. Martinho and M.Y. Santos
Fig. 6. aggFlights Hive table of the Big Data Warehouse
duration of the flights, the traveled distance, the number of flights, among others
attributes, can be analyzed.
The number of records in each table depends on the level of aggregation considered
in the transformation process, which varies attending to the dimensions that are
combined for a specific fact table. More or less detailed tables can be obtained. In the
examples here considered, the first table (aggDelays) has 123 534 969 records, containing all the available data as no summarization was here possible, while the second
table (aggFlights) has 3 774 583 records, accomplishing a significant summarization of
the available data.
5 Demonstration Case
In the implementation of the proposed architecture, the technologies previously shown
in Fig. 3 were used. In the demonstration case, a traditional Data Warehouse containing
more than 100 million records was used as the data source, related to data about flights
in the USA [23]. According to the architecture, the available data was extracted from
the traditional Data Warehouse and stored in HDFS, creating one file per table. After
this process, the data were transferred to Hive, where the logical data model was
identified and implemented, attending to the transformation rules specified in [13].
Once the Big Data Warehouse was loaded, Impala was used as the query engine,
allowing data analytics over the available data.
Tableau was used as the front-end tool, where specific dashboards were implemented. As an example of the analytical tools that can be provided to the users, Fig. 7
shows a dual-axis plot with the number of flights per time interval (blue bars) and the
average delay per flight in minutes (red line). In this plot, it is possible to see the hours
in which more flights were verified. Also, it calls our attention that it is at the end of the
day, late afternoon, that the highest average delay value is verified, around 30 min, in
the time interval [19:00–20:00[.
Another example of the analytical capabilities that can be provided to the users is
shown in the map of Fig. 8, where a color scale highlights the states with higher
incidence of flights, with Texas (TX) and California (CA) each having more than 11%
of the total number of flights.
An Architecture for Data Warehousing in Big Data Environments
Fig. 7. Number of flights and average delay per flight (Color figure online)
Fig. 8. Percentage of the number of flights per state
To verify if, using Tableau as front-end, the processing times are adequate
for decision support tasks, the time needed for processing several queries that integrate
the implemented dashboards, already presented in Table 1, was analyzed. Once
again, and although the architecture proposes Impala as the query engine, the
processing times needed by Hive and by Impala were compared. The results are shown
in Fig. 9.
B. Martinho and M.Y. Santos
Fig. 9. Benchmark of Hive and Impala through Tableau (time in seconds)
As can see in Fig. 9, Impala maintains the good performance shown before, for
processing the queries without Tableau as front-end, while Hive clearly decreases its
performance, obtaining processing times higher than 120 s, a scenario that is not
satisfactory in an iterative analytical context.
6 Conclusions
This paper presented an architecture for implementing Big Data Warehouses, which
uses HDFS as the staging area, Hive as the Data Warehouse, Impala as the query
engine and Tableau as the front-end analytical tool. For all extraction, loading and
transformation activities, Talend Open Studio for Big data was used. Using as data
source a traditional Data Warehouse, the concretization of a demonstration case
allowed the integration of all the proposed components and technologies, providing an
analytical environment where Impala ensures satisfactory processing times.
In this architecture, and for the specification of the data requirements, particular
attention was given to the Hive data model, which was derived considering a multidimensional data model of a traditional Data Warehouse.
As future work, the architecture will be extended for considering real-time processing needs, both in feeding the Big Data Warehouse and in the interactive analysis
and visualization of the data.
Acknowledgments. This work has been supported by COMPETE: POCI-01-0145-FEDER007043 and FCT (Fundação para a Ciência e Tecnologia) within the Project Scope:
UID/CEC/00319/2013, and by Portugal Incentive System for Research and Technological
Development, Project in co-promotion no. 002814/2015 (iFACTORY 2015–2018). Some of the
figures in this paper use icons made by Freepik, from www.flaticon.com.
An Architecture for Data Warehousing in Big Data Environments
1. Santhosh, B., Renjith, K.: Next generation data warehouse design with OLTP and OLAP
systems sharing same database. Int. J. Comput. Appl. 72(13), 45–50 (2013). doi:10.5120/
2. Dehdouh, K., Bentayeb, F., Boussaid, O., Kabachi, N.: Using the column oriented NoSQL
model for implementing big data warehouses. In: International Conference on Parallel and
Distributed Processing Techniques and Applications (PDPTA), Athens, 8–11 September (2015)
3. Chevalier, M., Malki, M.E., Kopliku, A., Teste, O., Tournier, R.: Implementing multidimensional data warehouses into NoSQL. In: The 17th International Conference on
Entreprise Information Systems (ICEIS), Barcelona, Spain (2015)
4. Krishnan, K.: Data Warehousing in the Age of Big Data, 1st edn. Morgan Kaufmann,
Elsevier Inc., Burlington (2013)
5. Aye, K.N., Thein, N.L.: A comparison of big data analytics approaches based on Hadoop
MapReduce. In: The 11th International Conference on Computer Applications, Yangon,
Myanmar (2013)
6. Khan, M.A.-U.-D., Uddin, M.F., Gupta, N.: Seven V’s of big data understanding big data to
extract value. In: Zone 1 Conference of the American Society for Engineering Education,
Bridgeport, CT, 3–5 April 2014
7. Cuzzocrea, A., Song, I., Davis, K.: Analytics over large-scale multidimensional data: the big
data revolution. In: The ACM 14th International Workshop on Data Warehousing and
OLAP, New York, USA
8. Colin, W.: Using big data for smarter decision making. IBM White Papers and Reports
9. Zikopoulos, P., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data:
Analytics for Enterprise Class Hadoop and Streaming Data, 1st edn. McGraw-Hill,
New York City (2011)
10. Chevalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Implementation of
multidimensional databases with document-oriented NoSQL. In: Madria, S., Hara, T. (eds.)
DaWaK 2015. LNCS, vol. 9263, pp. 379–390. Springer, Heidelberg (2015). doi:10.1007/
11. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H.,
Murthy, R.: Hive – a petabyte scale data warehouse using Hadoop. In: IEEE 26th
International Conference on Data Engineering (ICDE), Long Beach, CA, 1–6 March 2010
12. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J.,
Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I.,
Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S.,
Yoder, M.: Impala: A Modern Open-Source SQL Engine for Hadoop. Cloudera (2015)
13. Santos, M.Y., Carlos, C.: Data warehousing in big data: from multidimensional to tabular
data models. In: The International Conference on Computer Science & Software
Engineering, Porto, 20–22 July 2016
14. Dhawan, S., Rathee, S.: Big data analytics using Hadoop components like pig and hive. Am.
Int. J. Res. Sci. Technol. Eng. Math. 2(1), 88–93 (2013)
15. Dehdouh, K., Bentayeb, F., Boussaid, O., Kabachi, N.: Columnar NoSQL CUBE:
agregation operator for columnar NoSQL data warehouse. In: The IEEE International
Conference on Systems, Man and Cybernetics (SMC), San Diego, CA, 5–8 October 2014
16. Santos, M.Y., Carlos, C.: Data models in NoSQL databases for big data contexts. In: Tan, Y.,
Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 475–485. Springer, Heidelberg (2016).
B. Martinho and M.Y. Santos
17. Tamás, V., Péter, F., Krisztián, F., Hassan, C.: Denormalizing data into schema-free
databases. In: IEEE 4th International Conference on the Cognitive Infocommunications
(CogInfoCom), Budapest, 2–5 December 2013
18. Tria, F., Lefons, E., Filippo, T.: Design process for big data warehouses. In: The
International Conference on Data Science and Advanced Analytics (DSAA), Shanghai,
October 30–November 1 (2014)
19. Dan, H., Eleni, S.: A three-dimensional data model in HBase for large time-series dataset
analysis. In: The IEEE 6th International Workshop on the Maintenance and Evolution of
Service-Oriented and Cloud-Based Systems, Trento, Italy (2012)
20. Xiaopeng, L., Zhou, W.: Performance comparison of Hive, Impala and Spark SQL. In: The
7th International Conference on Intelligent Human-Machine Systems and Cybernetics,
Hangzhou, China, 26–27 August 2015
21. Li, J.: Design of real-time data analysis system based on Impala. In: The Advanced Research
and Technology in Industry Applications, Ottawa, Canada, 29–30 September 2014
22. Kulkarni, K., Lu, X., Panda, D.K.: Characterizing Cloudera Impala workloads with
BigDataBench on InfiniBand clusters. In: The 7th Workshop on Big Data Benchmarks,
Performance, Optimization, and Emerging Hardware, USA (2016)
23. RITA-BTS: RITA-BTS, Bureau of Transportation Statistics, United States department of
Transportation. http://stat-computing.org/dataexpo/2009/the-data.html
Decision Support in EIS
The Reference Model for Cost Allocation
Optimization and Planning for Business
Informatics Management
Petr Doucek(&), Milos Maryska(&), and Lea Nedomova
Faculty of Informatics and Statistics, University of Economics,
W. Churchill sq. 4, Prague, Czech Republic
Abstract. The proposed conceptual model deals with two areas – Cost Allocation and Planning for Management of Business Informatics. This paper shows
some limitations of the model, its architecture – the individual layers of the
model, key principles of cost allocation on which the proposed model is based,
and factors which must be taken into account during the development and
subsequent implementation of the model. Practical experience with model´s
implementation in business are discussed. In conclusion, there are several ideas
for the future development of the reference model.
Keywords: Performance management
allocation Profitability- management
Business informatics
1 Introduction
The main target of the company is achieving its targets and especially achieving profits
[1, 2]. Management requires exact information about company’s economic situation to
be able to set paths to these aims [3, 4]. This information is presented by measures [5–8].
Measuring results and performances have a long tradition [9, 10]. Rapid development in this area is visible especially in last ten years [11] and this development can
be split into two groups. The first group is especially about development in the area of
norms and processes how the measuring should be realized and the second group is
about development of tools for measuring results and performances with the support of
information and communication technologies (ICT) [10].
Measuring of results and performance is usually covered with Performance Measurement System (PMS) that helps organizations with a task to achieve their goals and
stay competitive by measuring and managing their efficiency and effectiveness of their
actions [12] and that covers both of above mentioned groups. Organizations feel that
PMS is contributing them to achieve their success but they face problems implementing
it University of Economics in Prague, Faculty of Finance and Accounting [13].
This feeling is based on necessity of information not only about the company as a
whole, but also details information about each of company’s parts. These necessities
are closely connected with company’s owners’ requirements on providing information
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 253–262, 2016.
DOI: 10.1007/978-3-319-49944-4_19
P. Doucek et al.
about company’s economic situation. One of the most important factors that is
influencing availability of information is quality of the Business Informatics.
One of the general approaches to creating and defining of PMS is Corporate Performance Management (CPM), which is a management concept that describes processes, methods, metrics and systems that are needed for performance management of a
company [14]. The main goal of CPM is allowing of measuring and management of
performance of a company or its parts and helps to company to achieve its goals.
Corporate Performance Management is an umbrella term that describes the
methodologies, metrics, processes and systems used to monitor and manage the
business performance of an enterprise.
Performance management means the process of collection, analysis and/or
reporting of information which are relevant to efficiency to individual, group, organization unit, system or their parts [15]. It could contain also the process and strategy
investigation or investigation of technological procedures and phenomenon in order to
identify if their outputs are in relevance to organization´s intentions and goals. The core
of performance management is the added value production in chain “Data – Value –
Metric – Measurement – Indicator – Information”. This chain must function consistent
in order not to produce distorted information.
Performance management can be understood as performance management in the
complex organization´s management system [16]. In this context, is performance
management decision and executive making process based on obtained data and
information in order to influence achieved outputs and results. For consistent performance management is not sufficient only the performance measurement system [17].
The condition of consistence must be also fulfil between measurement (what and how
is measured) and management (how are measurement interpreted, what is the goal and
how are results used).
This paper aims to present a conceptual reference model for cost allocation and
planning for efficient management of corporate informatics (REMONA). We have
found out that these areas of performance management in business informatics should
be covered by ICT support but these are not available [3, 16, 18]. During another
research we found out that our model should be based on the principles of Corporate
Performance Management (CPM) that is general approach to PMS. An inseparable part
of the presentation is holding a scholarly discussion about the presented model to
obtain feedback and opinions on its design from the academic community and from end
users. The REMONA model is proposed as part of an academic project of the Faculty
of Informatics and Statistics at the University of Economics in Prague in association
with the companies Profinit, s. r. o. and AM-line.
The design of the model follows the Design Science Research approach [19] that is
extended for additional scientific approaches:
• Case studies and qualitative research [20, 21] – we realized 7 business projects and
consultations with Chief Financial Officer in different companies during year 2010–
2014 that help us with formulation of lack of current solution and gaps that are feel
by managers.
The Reference Model for Cost Allocation Optimization and Planning
• Analysis of current state of the art that is proved by cited literature and synthesis
and deduction that help us with identification and formulation of the problem and
project goals,
• Design and development of the model,
• Final solution was tested by way of two case studies in selected international
companies [22].
2 The REMONA Model
The proposed model identifies key dimensions and indicators and interconnects them
within designed analytical cubes. REMONA is designed to be easily integrated into a
company and easily configured, which enables it to quickly tailor to the needs of a
specific company.
The REMONA model is based on the principles of Corporate Performance Management and Business Intelligence. The aim of the model is to offer a solution to two
key corporate tasks, ‘cost allocation management’ and ‘planning’. This solution is
inextricably connected with the tasks of analyses and in particular, what-if analyses.
For both tasks the model comprises key ‘Dimensions’, ‘Metrics’, ‘Drivers’ and
‘Activities’, which are addressed as part of corporate informatics. Another requirement
for the model is the possibility of its rapid and easy adaptation to a specific company in
which it will be implemented. This is achieved in the case of the REMONA model by
its logic being implemented as much as possible through appropriate links between
data cubes and related dimensions.
To get the full picture we should add that in the case of specific companies or
specific allocation rules or analyses of profitability we are ready to make required
changes directly in the reference model (solution code) and add new findings to the
original model through system feedback.
The proposed model is based on basic requisites, limitations and requirements
which must be fulfilled to ensure that REMONA can be easily and quickly implemented in a company. The model design is based on the following:
• The overall design of the model must be a general one so that it can be tailored to
the needs of a target organization.
• The proposed model must support easy and quick integration into corporate
• The model will be created in such a way that modifications can be made primarily
through configuration of the system, although it is possible that some functionality
may have to be developed to meet specific requirements.
• During the preparation of the model the necessary dimensions and key metrics must
be identified for tasks carried out in a given area.
When designing the model and subsequent implementation of the system it is
necessary to answer some key questions, which have to be taken into consideration as
they affect the preparation of the proposed model:
P. Doucek et al.
• What are current and expected main problems in economics and management of
development and operation of corporate informatics and what the priorities of the
• Are some of the standard methodologies (ITIL, CobiT) or proprietary methodology
or model used in the management of informatics?
• Is the management of corporate informatics based on the management of services
and service level agreements?
• What key metrics are required for the management of the economics of the system
for corporate informatics? Are any in use at present?
• Is there documentation of the management of corporate informatics and database
management from which data can be obtained? Are there data in them that could be
used to design and fulfilment of metrics and dimensions?
• Has an analysis been carried out of the level of ripeness of processes of corporate
informatics management and what are the results?
• How high a level of detail will be necessary for analytical tasks in the management
of economics of corporate informatics?
• How are costs of PI monitored?
• What is the place of cost analysis in corporate informatics management?
The proposed REMONA model is designed to permit easy and quick adaptation
(modification) of the solution according to the character of the answers to these
questions by parameters without high costs of additional alterations.
General Overview
The REMONA model is solved by the authors in following areas:
• Business tasks (Profitability, Planning and Cost Allocations)
• Dimensions and Metrics
• Application REMONA (software) that is solved from following views:
• Architecture,
• Data Model designed for the application and its Business Tasks,
• Application Layer that contains business logic of the solution defined in Business Tasks (see above),
• Reporting,
• Deployment that is describing all mandatory steps that have to be completed if
we want to use model in proper way,
• Initial parameters of services etc.
The Tasks describes issues that are solved with the model and dimensions and
metrics support solving of these issues from multidimensional point of view. Business
Tasks can be described as predefined processes that are using dimensions and metrics
defined for the model and these dimensions and metrics can be used for parameters of
the REMONA model.
These three business task were selected because cost allocation, planning, and profitability is growing in importance. The market and the economy are undergoing a negative
economic development. It is in such periods that managers demand accurate, detailed and
The Reference Model for Cost Allocation Optimization and Planning
up-to-date information not only about the company as a whole but also about its individual
parts [23]. Key activities and goals according to [4, 24, 25] include at present:
• Every company tries to get maximum return on each investment and clearly
identify, and in many cases calculate, the benefits of investments.
• Companies try to minimize or eliminate activities and processes which do not
generate the required value.
• Companies struggle against changing economic conditions.
• Measuring and managing a company as a whole and company informatics as one of
its parts is a phenomenon being closely monitored.
• Proving that investments are warranted (for example, in ICT) and proving the
achievement of expected or required results.
The application REMONA can be described as a software that is supplemented for
analytical and operational manuals. The REMONA model contains the design of data
model only but the REMONA model contains developed model with user interface in
the form of reporting. The REMONA model is designed in the context of easy
improvement. We expect that different companies requires different dimensions and
metrics. Our design contains the general dimensions and metrics only and these can be
supplemented by other users. From these reasons we selected IBM Cognos Express as
a platform that provides the best environment for easy improvement.
Business Architecture
The architecture can be described from several views with different degree of detail and
elements describing the model.
The basic view of the architecture is represented by individual layers integrated in
the model. It is a layer of Fig. 1:
Fig. 1. The remona model concept
P. Doucek et al.
• primary data sources,
• data integration – Data Stage (addressing questions of data pumps (ETL) and data
• core of data warehouse and data mart – partly addressed in REMONA,
• application layer and user interface layer (object of REMONA),
• a metadata layer passes through all the layers which is of key importance for end
users as it guarantees a standard language and description of all indicators and
attributes which are part of REMONA and the other layers of the company information system.
Web Forms
Data Entry
Query and
The architecture of the model shown in the following picture Fig. 2 is based on the
traditional architecture of a BI solution and modified for the purposes of the model with
the aim of allowing its integration into the architecture of an ordinary organization. The
picture shows in detail a view of the individual components of the architecture
described above as part of data warehouse and application layer.
Business Process Management and Collaboration
Enterprise Performance
Management Platform/Apps
„Manage the Organization“
Data Warehouse (CPM Model)
Data Profilling and Data Quality
Extract, Transform Load
Data Sources
Company’s DWH
External Data
Fig. 2. Remona business architecture
A layer of application tools (‘Application Platform) contains the business logic of
the proposed REMONA model which reads in data from a prefabricated data model.
2.2.1 Data Management Layer – Data Warehouse
The data model and the data warehouse developed according to it must cover the needs
implied from the analytic tasks to be performed over it.
The conceptual model is divided into four compact sections, which are further
detailed up to the level of physical data model, which is implemented in the selected
The Reference Model for Cost Allocation Optimization and Planning
database technology. The conceptual model is divided into the areas “Finance and
Management Accounting”, “Production Entity”, “Cost Allocation Entity” and “Other
Entity” [26]. Each of these areas covers a specific field.
2.2.2 Application Layer
The application is the most important part of the REMONA model. This is the operating part of the whole model. The application layer contains logic of the REMONA
model. This logic is based on rules and processes that are processing three defined
business tasks – profitability, planning and cost allocations. We have defined several
multidimensional cubes for each of these business tasks. Multidimensional cubes
allows processing of calculation steps defined in processes defined for each of business
We have divided cubes into four groups:
parametric (prefix RF – Rates and Factors),
calculating (prefix Lu – lookup),
data cubes (prefix C – Cube),
data cubes – Historical (prefix CH – Cube History).
Each of multidimensional cubes contains one dimension with metrics and at least
one standard dimension that is used for analyzing of data in the cube. We defined
relations among cubes that enables to process data in on-line mode. Data are
pre-processed during loading in batch mode. This approach makes solution faster.
2.2.3 User Interface Layer – Reporting
In the model design, we expect the use of two main types of presentation and
analytic-presentation layers:
• Native tools and the selected environment, i.e., the tool Cognos Express/Cognos
• The tool Microsoft Excel.
• The first group of tools provides full functionality implied by the fact that these are
native tools. As for the tool Microsoft Excel, the functionality of the solution is
provided by a plug-in module, which implements in the tool Microsoft Excel full
functionality of the systems Cognos Express/Cognos TM1.
3 Results and Implementation Experience
The REMONA model was implemented in two international companies, the first one
with headquarter in USA and the second one with headquarter in Germany.
The target of the first implementation (in USA) was verify that model is possible to
implement in big company and investigate what is missing in the model and what
should be improved or changed – especially in metrics and dimensions.
Model REMONA was adjusted according to the experiences from the first
implementation. Adjusted model was implemented in selected international company
in Germany.
P. Doucek et al.
The first implementation was carried out over the MS SQL Server 2012 database in
which the data warehouse of the REMONA model was implemented. The second used
system was IBM Cognos Express and Microsoft Excel 2010 as a reporting tool. The
solution was prepared over data for 1 year of the company’s life.
Implementation was split into 5 phases:
• Identification of the reason for introducing the REMONA Model to the target
company, business requirements and setup of the project
• Analysis
• Draft data warehouse and ETL
• Implementation of ETL and the REMONA Model
• Testing and verification of the setup
Based on the results the authors identified that one of the researched projects is
currently loss-making. Based on the results authors find out, that:
• high price of the accommodation of employees which are at the site of the solution
of the project which is not covered by the customer,
• higher share of time spent at the site as opposed to the agreement, where the plan of
the presence of employees at the site was lower (by 40%),
• higher travel expenses and board caused by the same reasons stated in the previous
The yield for the services provided is not stated as the cause which corresponds to the
agreed rate. The error of low profitability or loss, as the case may be, is caused by badly
calculated costs for the project and the inadequate price for the provision of services.
The second implementation provides information about efforts that have to be
invested into implementation in small company with DWH that have 85 tables. The
efforts were 45 man-days from which 22 man-days was implementation of standard
solution (analysis of target environment and processes and customization) and 23 mandays was adjustment of REMONA (development new functionality and new reports).
The savings from the second implementation were defined in areas as follows:
• Monthly financial statements are available from all subsidiaries.
• Real-time reporting over all subsidiaries, for example from point of view (Travel
costs, drawing of project budget, employees workload etc.).
• Unified processes for evidence without dependency on human factor.
• Saving of 1.2 FTE (full time equivalent) over all subsidiaries – automatized reporting.
All of these savings proved that the solution provides effects to companies that are
using this solutions and pilot implementation in both companies proved that implementation is possible and not so difficult.
4 Conclusions
CPM is very significant activity among business activities and its importance keeps
growing. It is especially very important activity from the point of view of company
The Reference Model for Cost Allocation Optimization and Planning
Business economics management cannot succeed without adequate procedures. If
this type of management is not successful, company top management cannot be
expected to support it and to consider it one of key departments keeping entire business
vital. If it is not possible to prove benefits, company management shall see it as a
simple cost item that should be minimized as much as possible.
The model provides effects in several areas.
For business practice the REMONA model and its implementations proved that:
• The REMONA model is adjustable to the requirements of the company – configuration is especially in the area of ETL that are reading data from the source systems
(DWH) and loading it into the REMONA, code lists, dimensions, metrics etc.
• The model takes into account current trends in the area of Business Intelligence,
Corporate Performance Management, reporting and current best practices.
• Effects for firm´s results from implementation of the REMONA model – faster and
more accurate reporting, cost reduction etc.
Acknowledgement. Paper was processed with contribution of long term institutional support of
research activities by Faculty of Informatics and Statistics, University of Economics, Prague. IP
1. Dopson, L.R.: Linking cost-volume-profit analysis with goal analysis in the curriculum using
spreadsheet applications. J. Hosp. Financ. Manag. 11(1), 104 (2003)
2. Halkos, G.E., Tzeremes, N.G.: International competitiveness in the ICT industry: evaluating
the performance of the top 50 companies. Glob. Econ. Rev. 36(2), 167–182 (2007)
3. Harindranath, G.: ICT in a transition economy: the case of hungary. J. Glob. Inf. Technol.
Manag. 11(4), 33–55 (2008)
4. Young, R.C.: Goals and goal-setting. J. Am. Inst. Planners 32(2), 76–85 (1966)
5. Ling, F.Y.Y., Peh, S.: Key performance indicators for measuring contractors’ performance.
Archit. Sci. Rev. 48(4), 357–365 (2005)
6. Martin, P.R., Patterson, J.W.: On measuring company performance within a supply chain.
Int. J. Prod. Res. 47(9), 2449–2460 (2009)
7. Sarker, B.R., Khan, M.: A comparison of existing grouping efficiency measures and a new
weighted grouping efficiency measure. IIE Trans. 33(1), 11–27 (2001)
8. Sauka, A.: Measuring the competitiveness of Latvian companies. Baltic J. Econ. 14(1–2),
140–158 (2014)
9. Chen, Y., Zhu, J.: Measuring information technology’s indirect impact on firm performance.
Inf. Technol. Manag. 5(1–2), 9–22 (2004)
10. Manning, R., White, H.: Measuring results in development: the role of impact evaluation in
agency-wide performance measurement systems. J. Dev. Effectiveness 6(4), 337–349 (2014)
11. Thompson, M., Walsham, G.: ICT research in Africa: need for a strategic developmental
focus. Inf. Technol. Dev. 16(2), 112–127 (2010)
12. Neely, A., Gregory, M., Platts, K.: Performance measurement system design: a literature
review and research agenda. Int. J. Oper. Prod. Manag. 15(4), 80–116 (1995)
P. Doucek et al.
13. Petera, P., Wagner, J., Menšík, M.: Strategic performance measurement systems implemented in the biggest Czech companies with focus on balanced scorecard - an empirical
study. J. Competitiveness 4(4), 67–85 (2012)
14. Gartner Inc.: Corporate Performance Management: BI Collides with ERP (2001). https://
15. Upadhaya, B., Munir, R., Blount, Y.: Association between performance measurement
systems and organisational effectiveness. Int. J. Oper. Prod. Manag. 34(7), 853–875 (2014)
16. Berry, A.J., Coad, A.F., Harris, E.P., Otley, D.T., Stringer, C.: Emerging themes in
management control: a review of recent literature. Br. Acc. Rev. 41(1), 2–20 (2009)
17. Flapper, S.D.P., Fortuin, L., Stoop, P.P.M.: Towards consistent performance management
systems. Int. J. Oper. Prod. Manag. 16(7), 27–37 (1996)
18. Voříšek, J., Pour, J., Buchalcevová, A.: Management of business informatics model –
principles and practices. E+M Ekonomie a Manag. 18(3), 160–173 (2015)
19. Hendl, J.: Kvalitativní výzkum. Portal, Praha (2016)
20. Maryska, M.: Referenční model optimalizace nákladove alokace a plánování pro řízení
podnikové informatiky (habilitační práce). VSE, Praha (2014)
21. Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research
methodology for information systems research. J. Manag. Inf. Syst. 24(3), 45–77 (2007)
22. Yin, R.K.: Case Study Research: Design and Methods. SAGE, Thousand Oaks (2009)
23. Král, B.: Manažerské účetnictví. Management Press (2012)
24. Dimon, R.: Enterprise Performance Management Done Right: An Operating System for
Your Organization. Wiley, Hoboken (2013)
25. Turban, E., Leidner, D., Mclean, E., Wetherbe, J.: Information Technology for Management:
Transforming Organizations in the Digital Economy. Wiley Student Edition (2012)
26. Maryska, M., Wagner, J.: Reference model of business informatics economics management.
J. Bus. Econ. Manag. 16(3), 621–637 (2015)
An Entropy Based Algorithm for Credit Scoring
Roberto Saia(B) and Salvatore Carta
Dipartimento di Matematica e Informatica, Università di Cagliari,
Via Ospedale 72, 09124 Cagliari, Italy
Abstract. The request of effective credit scoring models is rising in
these last decades, due to the increase of consumer lending. Their objective is to divide the loan applicants into two classes, reliable or unreliable,
on the basis of the available information. The linear discriminant analysis is one of the most common techniques used to define these models,
although this simple parametric statistical method does not overcome
some problems, the most important of which is the imbalanced distribution of data by classes. It happens since the number of default cases is
much smaller than that of non-default ones, a scenario that reduces the
effectiveness of the machine learning approaches, e.g., neural networks
and random forests. The in Maximum Entropy (DME) approach proposed in this paper leads toward two interesting results: on the one hand,
it evaluates the new loan applications in terms of maximum entropy difference between their features and those of the non-default past cases,
using for the model training only these last cases, overcoming the imbalanced learning issue; on the other hand, it operates proactively, overcoming the cold-start problem. Our model has been evaluated by using two
real-world datasets with an imbalanced distribution of data, comparing
its performance to that of the most performing state-of-the-art approach:
random forests.
Keywords: Business intelligence
Credit scoring
Data mining
The processes taken into account in this paper typically start with a loan application (from now on named as instance) and end with a repayment (or not
repayment) of the loan. Although the retail lending represents one of the most
profitable source of income for the financial operators, the increase of loans is
directly related to the increase of the number of defaulted cases, i.e., fully or
partially not repaid loans. In short, the credit scoring is used to classify, on the
basis of the available information, the loan applicants into two classes, reliable
or unreliable (or better, referring to their instances, accepted or rejected ). Considering its capability to reduce the losses of money, it is clear that it represents
c IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 263–276, 2016.
DOI: 10.1007/978-3-319-49944-4 20
R. Saia and S. Carta
an important tool, as stated in [1]. More formally, the credit scoring techniques
can be defined as a group of statistical methods used to infer the probability
that an instance leads toward a default [2,3].
Whereas that their processes involve all the factors that contribute to determine the credit risk [4] (i.e., probability of loss from a debtor’s default), they
allow the financial operators to evaluate this aspect. Other advantages related
to these techniques are the reduction of the credit analysis cost, a quick response
time in the credit decisions, and the possibility to accurately monitor the credit
activities [5]. The design of effective approaches for credit scoring is not a simple
task, due to a series of problems, the most important of which is the imbalanced distribution of data [6] used to train the models (small number of default
cases, compared to the non-default ones), which reduces the effectiveness of the
machine learning strategies [7].
The idea behind this paper is to evaluate an instance in terms of its features’
entropy, and to define a metric able to determinate its level of reliability, on the
basis of this criterion. In more detail, we measure the difference, in terms of
maximum Shannon entropy (from now on referred simply as entropy), between
the same instance features, before and after we added to the set of non-default
past instances an instance to evaluate. In the information theory, the entropy
gives a measure of the uncertainty of a random variable. The larger it is, the less
a-priori information one has on the value of it, then the entropy increases as the
data becomes equally probable and decreases when their chances are unbalanced.
It should be observed that, when all data have the same probability, we achieve
the maximum uncertainty in the process of prediction of future data.
On the basis of the previous considerations, we evaluate a new instance in
terms of uncertainty of its feature values: comparing the entropy of the set
of non-default past instances, before and after we added to it the instance to
evaluate. A larger entropy will indicate that the new instance contains similar
data (increasing the level of equiprobability), otherwise, it contains different
data, thus it represents a potential default case (in terms of non-similarity with
the values of the non-default cases).
In our approach we choose to operate without take into account the default
cases. Such strategy presents a twofold advantage: first, we can operate proactively, i.e., without the need to use default cases to train our model; second, we
overcome the cold-start problem related with the scarcity (or total absence) of
default cases, considering that in a real-world context they are much less than
the non-default ones.
Given that in most of the cases reported in the literature [8–10] the Random
Forests approach outperforms the other ones in this context, we will compare
the proposed approach only to this one.
The main contributions of our work to the state of the art are listed below:
(i) calculation of the local maximum entropy by features (Λ), which gives us
information about the entropy achieved by each feature in the set of nondefault cases (it allows us to measure the differences between instances in
terms of single features);
An Entropy Based Algorithm for Credit Scoring
(ii) calculation of the global maximum entropy (γ), which represents a metafeature based on the integral of the area under curve of the Λ values (it allows
us to measure the difference between instances in terms of all features);
(iii) formalization of the Difference in Maximum Entropy (DM E) approach,
used to classify the unevaluated instances as accepted or rejected, by exploiting the Λ and γ information;
(iv) evaluation of the DM E approach on two real-world datasets, by comparing
its performance with those of a state-of-the-art approach such as Random
Forest (in our case, without using the past default cases to train the model).
The remainder of the paper is organized as follows: Sect. 2 discusses the
background and related work; Sect. 3 provides a formal notation and defines the
problem faced in this paper; Sect. 4 describes the implementation of the proposed
approach; Sect. 5 provides details on the experimental environment, the adopted
datasets and metrics, as well as on the used strategy and the experimental results;
some concluding remarks and future work are given in the last Sect. 6.
Background and Related Work
A large number of credit scoring classification techniques have been proposed in
literature [11], as well as many studies aimed to compare their performance on
the basis of several datasets, such as in [8], where a large scale benchmark of 41
classification methods has been performed across eight credit scoring datasets.
The problem of how to choose the best classification approach and how to
tune optimally its parameters was instead faced in [12]; in the same work are
reported some useful observations about the canonical metrics of performance
used in this field [13].
Credit Scoring Models
Most of the statistical and data mining techniques at the state of the art can
be used in order to build credit scoring models [14,15], e.g., linear discriminant
models [16], logistic regression models [3], neural network models [17,18], genetic
programming models [19,20], k-nearest neighbor models [21], and decision tree
models [22,23].
These techniques can also be combined in order to create hybrid approaches
of credit scoring, as that proposed in [24,25], based on a two-stage hybrid modeling procedure with artificial neural networks and multivariate adaptive regression splines, or that presented in [26], based on neural networks and clustering
Imbalanced Class Distribution
A complicating factor in the credit scoring process is the imbalanced class distribution of data [7,27], caused by the fact that the default cases are much smaller
R. Saia and S. Carta
than the non-default ones. Such distribution of data reduces the performance of
the classification techniques, as reported in the study made in [9].
The misclassification costs during the processes of scorecard construction
and those of the classification were studied in [28], where it is also proposed to
preprocess the dataset of training through an over-sampling or a under-sampling
of the classes. Its effect on the performance has been deeply studied in [29,30].
Cold Start
The cold start problem [31,32] arises when there are not enough information to
train a reliable model about a domain [33–35].
In the credit scoring context, this happens when there are not many instances
related to the credit-worthy and non-credit-worthy customers [36,37]. Considering that, during the model definition, the proposed approach does not exploit
the data about the defaulting loans, it is able to reduce/overcome the aforementioned issue.
Random Forests
Random Forests represent an ensemble learning method for classification and
regression that is based on the construction of a number of randomized decision
trees during the training phase and it infers conclusions by averaging the results.
Since its formalization [38], it represents one of the most common techniques
for data analysis, thanks to its better performance w.r.t. the other state-ofthe-art techniques. This technique allows us to face a wide range of prediction
problems, without performing any complex configuration, since it only requires
the tuning of two parameters: the number of trees and the number of attributes
used to grow each tree.
Shannon Entropy
The Shannon entropy, formalized by Claude E. Shannon in [39], is one of the
most important metrics used in information theory. It reports the uncertainty
associated with a random variable, allowing us to evaluate the average minimum
number of bits needed to encode a string of symbols, based on their frequency.
More formally, given a set of values v ∈ V , the entropy H(V ) is defined as
shown in the Eq. 1, where P (v) is the probability that the element v is present
in the set V .
P (v)log2 [P (v)]
H(V ) = −
For instance, if we have a symbol set V = {v1 , v2 , v3 , v4 , v5 } where the symbol
occurrences in terms of frequency are v1 = 0.5, v2 = 0.2, v3 = 0.1, v4 = 0.1, v5 =
0.1, the entropy H(V ) (i.e., the average minimum number of bits needed to
represent a symbol) is given by the Eq. 2. Rounding up the result, we need 2
An Entropy Based Algorithm for Credit Scoring
bits/per symbol. So, to represent a sequence of five characters optimally, we need
10 bits.
P (v)log2 [P (v)] = 1.9
H(V ) = −
In the context of the classification methods, the use of entropy-based metrics
is typically restricted to the feature selection [40–42], the process where a subset
of relevant features (variables, predictors) is selected and used for the definition
of the classification model.
In this work, we instead use this metric to detect anomalous values in the
features of a new instance, where anomalous stands for values different from
those in the history of the non-default cases.
Notation and Problem Definition
This section introduces some notational conventions used in this paper and
defines the faced problem.
Given a set of classified instances T = {t1 , t2 , . . . , tN }, and a set of features
F = {f1 , f2 , . . . , fM } that compose each t, we denote as T+ ⊆ T the subset of
non-default instances, and as T− ⊆ T the subset of default ones.
We also denote as T̂ = {t̂1 , t̂2 , . . . , t̂U } a set of unclassified instances and
as E = {e1 , e2 , . . . , eU } these instances after the classification process, thus
|T̂ | = |E|.
Each instance can belong only to one class c ∈ C, where C = {accepted,
Problem Definition
On the basis of the Λ and γ information (explained in Sect. 4.1), calculated
before and after we added to the set T+ the unclassified instances in the set T̂
(one by one), we classify each instance t̂ ∈ T̂ as accepted or rejected.
Given a function eval(t̂, λ, γ) created to evaluate the correctness of the t̂ classification made by exploiting the λ and γ information, which returns a boolean
value σ (0 = misclassif ication, 1 = correctclassif ication), we formalize our
objective as the maximization of the results sum, as shown in Eq. 3.
max σ =
0≤σ≤|T̂ |
|T̂ |
eval(t̂u , λ, γ)
R. Saia and S. Carta
Our Approach
The implementation of our approach is carried out through the following three
1. Local Maximum Entropy by Features: calculation of the local maximum
entropy by features Λ, aimed to obtain information about the maximum level
of entropy assumed by each feature in the set T+ ;
2. Global Maximum Entropy: calculation of the global maximum entropy γ,
a meta-feature defined on the basis of the integral of the area under curve of
the maximum entropy by features Lambda;
3. Difference in Maximum Entropy: formalization of the Difference in Maximum Entropy (DM E) algorithm, able to classify a new instance as accepted
or rejected, on the basis of the Λ and γ information.
In the following, we provide a detailed description of each of these steps.
Local Maximum Entropy by Features
Denoting as H(f ) the entropy measured in the values assumed by a feature
f ∈ F in the set T+ , we define the set Λ as shown in Eq. 4. It contains the
maximum entropy achieved by each f ∈ F , so we have that |Λ| = |F |. We use
this information during the evaluation process explained in Sect. 4.3.
Λ = {λ1 = max(H(f1 )), λ2 = max(H(f2 )), . . . , λM = max(H(fM ))}
Global Maximum Entropy
We denote as global maximum entropy γ the integral of the area under curve of
the maximum entropy by features Λ (previously defined in Sect. 4.1), as shown
in Fig. 1.
More formally, the value of γ is calculated by using the trapezium rule, as
shown in Eq. 5.
Entropy (Λ)
Features (F )
Fig. 1. Global maximum entropy γ
An Entropy Based Algorithm for Credit Scoring
It is a meta-feature that gives us information about the maximum entropy
achieved by all features in T+ , before and after we added to it a unevaluated instance. We use this information during the evaluation process (Sect. 4.3),
jointly with that given by Λ.
f (x) dx ≈
(f (xk+1 ) + f (xk )) with Δx =
(λM −λ1 )
Difference in Maximum Entropy
The Difference in Maximum Entropy (DM E) Algorithm 1 is aimed to evaluate
and classify a set of unevaluated instances.
It takes as input a set T+ of non-default instances occurred in the past and
a set T̂ of unevaluated instances, returning as output a set E containing all
instances in T̂ , classified as accepted or rejected on the basis of the Λ and γ
In step 2 we calculate the Λa value by using the non-default instances in T+ ,
as described in Sect. 4.1, while in step 3 we obtain the global maximum entropy
γ (Sect. 4.2). The steps from 4 to 26, process all the instances t̂ ∈ T̂ .
After the calculation of the Λb and γb values (steps 5 and 6 ), performed by
adding the current instance t̂ to the non-default instances set T+ , in the steps
from 7 to 13, we compare each λa ∈ Λa with the corresponding feature λb ∈ Λb
Algorithm 1. Dif f erence in M aximum Entropy (DM E)
Input: T+ =Set of non-default instances, T̂ =Set of instances to evaluate
Output: E=Set of classified instances
1: procedure InstancesEvaluation(T+ ,T̂ )
Λa =getLocalMaxEntropy(T+ )
γa =getGlobalMaxEntropy(Λa )
for each t̂ in T̂ do
Λb =getLocalMaxEntropy(T+ + t̂)
γb =getGlobalMaxEntropy(Λb )
for each λ in Λ do
if λb > λa then
end if
end for
if γb > γa then
end if
if b > a then
E ← (t̂,rejected)
E ← (t̂,accepted)
end if
a = 0;
b = 0;
end for
return E
28: end procedure
R. Saia and S. Carta
(steps from 8 to 12 ), counting how many times the value of λb is greater than
that of λa , increasing the value of b (step 9 ) when this happens, or that of a
otherwise (step 11 ); in the steps from 14 to 18 we perform the same operation,
but by taking into account the global maximum entropy γ.
At the end of the previous sub-processes, in the steps from 19 to 23 we
classify the current instance as accepted or rejected, on the basis of the a and
b values, then we set them to zero (steps 24 and 25 ). The resulting set E is
returned at the end of the entire process (step 27 ).
This section describes the experimental environment, the used datasets and metrics, the adopted strategy, and the results of the performed experiments.
Experimental Setup
The proposed approach was developed in Java, while the implementation of the
state-of-the-art approach used to evaluate its performance was made in R1 , using
the randomForest and ROCR packages.
The experiments have been performed by using two real-world datasets characterized by a strong unbalanced distribution of data. For reasons of reproducibility of the RF experiments, the R function set.seed() has been used in
order to fix the seed of the random number generator. The RF parameters have
been tuned by searching those that maximize the performance.
It should be further added that we verified the existence of a statistical difference between the results, by using the independent-samples two-tailed Student’s
t-tests (p < 0.05).
The two real-world datasets used in the experiments (i.e., Default of Credit Card
Clients dataset and German Credit datasets, both available at the UCI Repository of Machine Learning Databases 2 ) represent two benchmarks in this research
field. In the following we provide a short description of their characteristics:
Default of Credit Card Clients (DC). It contains 30,000 instances: 23,364
of them are credit-worthy applicants (77.88%) and 6,636 are not credit-worthy
(22.12%). Each instance contains 23 attributes and a binary class variable
(accepted or rejected ).
German Credit (GC). It contains 1,000 instances: 700 of them are creditworthy applicants (70.00%) and 300 are not credit-worthy (30.00%). Each
instance contains 21 attributes and a binary class variable (accepted or rejected ).
An Entropy Based Algorithm for Credit Scoring
This section presents the metrics used in the experiments.
Accuracy. The Accuracy metric reports the number of instances correctly classified, compared to the total number of them. More formally, given a set of
instances X to be classified, it is calculated as shown in Eq. 6, where |X| stands
for the total number of instances, and |X (+) | for the number of those correctly
|X (+) |
Accuracy(X) =
F-Measure. The F-measure is the weighted average of the precision and recall
metrics. It is a largely used metric in the statistical analysis of binary classification and gives us a value in a range [0, 1], where 0 represents the worst value
and 1 the best one. More formally, given two sets X and Y , where X denotes
the set of performed classifications of instances, and Y the set that contains the
actual classifications of them, this metric is defined as shown in Eq. 7.
F -measure(X, Y ) = 2 ·
(precision(X, Y ) · recall(X, Y ))
(precision(X, Y ) + recall(X, Y ))
|Y ∩ X|
|Y ∩ X|
, recall(X, Y ) =
precision(X, Y ) =
|Y |
AUC. The Area Under the Receiver Operating Characteristic curve (AU C)
is a performance measure used to evaluate the effectiveness of a classification
model [43,44]. Its result is in a range [0, 1], where 1 indicates the best performance. More formally, according with the notation of Sect. 3, given the subset
of non-default instances T+ and the subset of default ones T− , the formalization
of the AU C metric is reported in the Eq. 8, where Θ indicates all possible comparisons between the instances of the two subsets T+ and T− . It should be noted
that the result is obtained by averaging over these comparisons.
⎪ 1, if t+ > t−
|T+ | |T− |
Θ(t+ , t− ) = 0.5, if t+ = t−
AU C =
Θ(t+ , t− ) (8)
|T+ | · |T− | 1 1
0, if t+ < t−
The experiments have been performed by using the k-fold cross-validation criterion, with k=10. Each dataset is randomly shuffled and then divided in k subsets;
each k subset is used as test set, while the other k-1 subsets are used as training
set, considering as result the average of all results. This approach allows us to
reduce the impact of data dependency and improves the reliability of the results.
R. Saia and S. Carta
Experimental Results
As shown in Figs. 2 and 3, the performance of our DM E approach is very similar
to the RF one, both in terms of Accuracy and in terms of F-measure, where we
achieve better performance than RF with the DC dataset.
By examining the obtained results, the first observation that rises is related
to the fact that our approach gets the same performance of RF , despite it
operates in a proactive manner (i.e., without using default cases during the
training process).
Another observation arises about the F-measure results, which show how
the effectiveness of our approach increases with the number of non-default
instances used in the training process (DC dataset). This does not happen with
RF , although it uses both default and non-default instances, during the model
We can observe interesting results also in terms of AUC : this metric evaluates
the predictive capability of a classification model, and the results in Fig. 4 show
that our performance is similar to those of RF , although we did not train our
model with both classes of instances.
Fig. 2. Accuracy performance
Fig. 3. F-measure performance
An Entropy Based Algorithm for Credit Scoring
Fig. 4. AUC performance
It should be noted that, as introduced in Sect. 1, the capability of the DM E
approach to operate proactively allows us to reduce/overcome the cold-start
Conclusions and Future Work
The credit scoring techniques cover a crucial role in many financial contexts
(i.e., personal loans, insurance policies, etc.), since they are used by financial
operators in order to evaluate the potential risks of lending, allowing them to
reduce the losses due to default.
This paper proposes a novel approach of credit scoring that exploits an
entropy-based criterion to classify a new instance as accepted or rejected.
Considering that it does not need to be trained with the past default
instances, it is able to operate in a proactive manner, also reducing/overcoming
the cold-start and the data imbalance problems that reduce the effectiveness of
the canonical machine learning approaches.
The experimental results presented in Sect. 5.5 show two important aspects
related to our approach: on the one hand, it performs similarly to one of the best
performing approaches in the state of the art (i.e., RF ), by operating proactively;
on the other hand, it is able to outperform RF , when in the training process a
large number of non-default instances are involved, differently from RF , where
the performance does not improve further.
A possible follow up of this paper could be a new series of experiments
aimed at improving the non-proactive state-of-the-art approaches, by adding
the information related to the default cases, as well as the evaluation of the
proposed approach in heterogeneous scenarios, which involve different types of
financial data, such as those generated by an electronic commerce environment.
Acknowledgments. This work is partially funded by Regione Sardegna under project
NOMAD (Next generation Open Mobile Apps Development), through PIA - Pacchetti
Integrati di Agevolazione “Industria Artigianato e Servizi” (annualità 2013).
R. Saia and S. Carta
1. Henley, W., et al.: Construction of a k-nearest-neighbour credit-scoring system.
IMA J. Manag. Math. 8(4), 305–321 (1997)
2. Mester, L.J., et al.: Whats the point of credit scoring? Bus. Rev. 3, 3–16 (1997)
3. Henley, W.E.: Statistical aspects of credit scoring. Ph.D. thesis, Open University
4. Fensterstock, A.: Credit scoring and the next step. Bus. Credit 107(3), 46–49
5. Brill, J.: The importance of credit scoring models in improving cash flow and
collections. Bus. Credit 100(1), 16–17 (1998)
6. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newslett.
6(1), 20–29 (2004)
7. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell.
Data Anal. 6(5), 429–449 (2002)
8. Lessmann, S., Baesens, B., Seow, H., Thomas, L.C.: Benchmarking state-of-theart classification algorithms for credit scoring: an update of research. Eur. J. Oper.
Res. 247(1), 124–136 (2015)
9. Brown, I., Mues, C.: An experimental comparison of classification algorithms for
imbalanced credit scoring data sets. Expert Syst. Appl. 39(3), 3446–3453 (2012)
10. Bhattacharyya, S., Jha, S., Tharakunnel, K.K., Westland, J.C.: Data mining for
credit card fraud: a comparative study. Decis. Support Syst. 50(3), 602–613 (2011)
11. Doumpos, M., Zopounidis, C.: Credit scoring. In: Doumpos, M., Zopounidis, C.
(eds.) Multicriteria Analysis in Finance. Springer Briefs in Operations Research,
pp. 43–59. Springer, Heidelbeg (2014)
12. Ali, S., Smith, K.A.: On learning algorithm selection for classification. Appl. Soft
Comput. 6(2), 119–138 (2006)
13. Hand, D.J.: Measuring classifier performance: a coherent alternative to the area
under the ROC curve. Mach. Learn. 77(1), 103–123 (2009)
14. Chen, S.Y., Liu, X.: The contribution of data mining to information science. J. Inf.
Sci. 30(6), 550–558 (2004)
15. Alborzi, M., Khanbabaei, M.: Using data mining and neural networks techniques
to propose a new hybrid customer behaviour analysis and credit scoring model in
banking services based on a developed RFM analysis method. IJBIS 23(1), 1–22
16. Reichert, A.K., Cho, C.C., Wagner, G.M.: An examination of the conceptual issues
involved in developing credit-scoring models. J. Bus. Econ. Stat. 1(2), 101–114
17. Desai, V.S., Crook, J.N., Overstreet, G.A.: A comparison of neural networks and
linear scoring models in the credit union environment. Eur. J. Oper. Res. 95(1),
24–37 (1996)
18. Blanco-Oliver, A., Pino-Mejı́as, R., Lara-Rubio, J., Rayo, S.: Credit scoring models
for the microfinance industry using neural networks: evidence from Peru. Expert
Syst. Appl. 40(1), 356–364 (2013)
19. Ong, C.S., Huang, J.J., Tzeng, G.H.: Building credit scoring models using genetic
programming. Expert Syst. Appl. 29(1), 41–47 (2005)
20. Chi, B., Hsu, C.: A hybrid approach to integrate genetic algorithm into dual scoring
model in enhancing the performance of credit scoring model. Expert Syst. Appl.
39(3), 2650–2661 (2012)
An Entropy Based Algorithm for Credit Scoring
21. Henley, W., Hand, D.J.: A k-nearest-neighbour classifier for assessing consumer
credit risk. J. Roy. Stat. Soc. Ser. D (Stat.) 45, 77–95 (1996)
22. Davis, R., Edelman, D., Gammerman, A.: Machine-learning algorithms for creditcard applications. IMA J. Manag. Math. 4(1), 43–51 (1992)
23. Wang, G., Ma, J., Huang, L., Xu, K.: Two credit scoring models based on dual
strategy ensemble trees. Knowl.-Based Syst. 26, 61–68 (2012)
24. Lee, T.S., Chen, I.F.: A two-stage hybrid credit scoring model using artificial neural
networks and multivariate adaptive regression splines. Expert Syst. Appl. 28(4),
743–752 (2005)
25. Wang, G., Hao, J., Ma, J., Jiang, H.: A comparative assessment of ensemble learning for credit scoring. Expert Syst. Appl. 38(1), 223–230 (2011)
26. Hsieh, N.C.: Hybrid mining approach in the design of credit scoring models. Expert
Syst. Appl. 28(4), 655–665 (2005)
27. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data
Eng. 21(9), 1263–1284 (2009)
28. Vinciotti, V., Hand, D.J.: Scorecard construction with unbalanced class sizes. J.
Iran. Stat. Soc. 2(2), 189–205 (2003)
29. Marqués, A.I., Garcı́a, V., Sánchez, J.S.: On the suitability of resampling techniques for the class imbalance problem in credit scoring. JORS 64(7), 1060–1070
30. Crone, S.F., Finlay, S.: Instance sampling in credit scoring: an empirical study of
sample size and balancing. Int. J. Forecast. 28(1), 224–238 (2012)
31. Zhu, J., Wang, H., Yao, T., Tsou, B.K.: Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In:
Scott, D., Uszkoreit, H. (eds.) COLING 2008, 22nd International Conference on
Computational Linguistics, Proceedings of the Conference, 18–22 August 2008,
Manchester, UK, pp. 1137–1144 (2008)
32. Donmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In: Kok,
J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.)
ECML 2007. LNCS (LNAI), vol. 4701, pp. 116–127. Springer, Heidelberg (2007).
doi:10.1007/978-3-540-74958-5 14
33. Lika, B., Kolomvatsos, K., Hadjiefthymiades, S.: Facing the cold start problem in
recommender systems. Expert Syst. Appl. 41(4), 2065–2073 (2014)
34. Son, L.H.: Dealing with the new user cold-start problem in recommender systems:
a comparative review. Inf. Syst. 58, 87–104 (2016)
35. Fernández-Tobı́as, I., Tomeo, P., Cantador, I., Noia, T.D., Sciascio, E.D.: Accuracy
and diversity in cross-domain recommendations for cold-start users with positiveonly feedback. In: Sen, S., Geyer, W., Freyne, J., Castells, P., (eds.) Proceedings
of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19
September 2016, pp. 119–122. ACM (2016)
36. Attenberg, J., Provost, F.J.: Inactive learning? difficulties employing active learning in practice. SIGKDD Explor. 12(2), 36–41 (2010)
37. Thanuja, V., Venkateswarlu, B., Anjaneyulu, G.: Applications of data mining in
customer relationship management. J. Comput. Math. Sci. 2(3), 399–580 (2011)
38. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
39. Shannon, C.E.: A mathematical theory of communication. Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
40. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4),
131–156 (1997)
41. Kwak, N., Choi, C.: Input feature selection for classification problems. IEEE Trans.
Neural Netw. 13(1), 143–159 (2002)
R. Saia and S. Carta
42. Jiang, F., Sui, Y., Zhou, L.: A relative decision entropy-based feature selection
approach. Pattern Recogn. 48(7), 2151–2163 (2015)
43. Powers, D.M.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011)
44. Faraggi, D., Reiser, B.: Estimation of the area under the ROC curve. Stat. Med.
21(20), 3093–3106 (2002)
Visualizing IT Budget to Improve Stakeholder
Communication in the Decision
Making Process
Alexia Pacheco(&), Gustavo López, and Gabriela Marín-Raventós
Research Center for Communication and Information Technologies (CITIC),
University of Costa Rica (UCR), San José, Costa Rica
Abstract. Traditionally in large enterprises, budget cuts are a treat for IT
departments. One way to guard IT budget is visualizing the impacts in IT
services of such cuts. Data visualization tools are capable of bridging the gap
between increased data availability and human cognitive capabilities. In this
paper, we present a budget visualization tool that allows enterprise wide
data-driven decision-making. Our proposal was developed in the context of a
large multi-industry state-owned company, with rigid control structures and
external pressures for cost reduction and investment optimization. Our tool
promotes visualization as the main mechanism to justify IT budget requests and
to defend from budget amendments and cuts. We propose a generic tool that
might manage different perceptions from many parts of an organization. However, to evaluate our tool’s effectiveness we incorporated four stakeholder’s
perspectives: financial, technical, business’ clients and supply chain. In our
efforts, we developed a data model that encapsulates these four perspectives and
improves communication capabilities between stakeholders.
Keywords: Enterprise systems IT budget
modeling Data-driven decision-making
Data visualization
1 Introduction
Budget creation and allocation are an integral part of running any organization efficiently and effectively. Budgets not only serve as planning mechanisms but also as a
starting point for controlling programs within organizations.
The traditional approach to understand budgets is: given a certain amount of
money, how much will be allocated to each of the required expenses? Budgets serve
both to determine how much to spend and to judge spending performances [1].
“A budget is a set of interlinked plans that quantitatively describe an entity’s projected
future operations” [2]. Budgeting typically begins with strategic planning at senior
management level, and lower level managers in the organization are asked to defend
their budgets assessing execution and possible budget cut impacts.
Most of the organizations are software and data driven nowadays. The amount of
information available is increasing rapidly, and data visualization tools are capable of
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 277–290, 2016.
DOI: 10.1007/978-3-319-49944-4_21
A. Pacheco et al.
bridging the gap between the data daily increment and the organizational human
cognitive capabilities.
IT budget needs to be documented, so that its management becomes explicit and
visible to involved stakeholders. If IT budget is not properly documented, external
pressures could result in budget cuts that could consciously or unconsciously compromise IT business services.
In this paper, we present a budget visualization tool, which can be used to support
and improve budget negotiation, and to analyze the possible impacts on IT services,
caused by budget cuts. Moreover, the proposal allows parameterization of visualization
techniques to create an enterprise system that allows a more comprehensive view for
each stakeholder involved in the budgeting process.
The rest of this paper is structured as follows: Sect. 2, describes the theoretical
background for our research including IT budget definition and stakeholder’s perspectives, technical regulations, standards and frameworks that affect IT budgeting in
context like the one we are working in. Finally, technical components used for our
system implementation are described. Section 3 describes the context in which our
system was developed and evaluated. Section 4 presents an overview of our budget
visualization tool, its components and architecture. Finally, in Sect. 5, we discuss the
results gathered from our system implementations and evaluation, present some final
remarks and describe future work.
2 Background
This section introduces the main concepts used in our research, including: IT budget,
regulations that somehow affect budgeting, and different stakeholder perspectives about
IT budgets. Moreover, we briefly describe the graph databases and visualization
techniques used to implement and evaluate our budget visualization system.
Information Technology Budget
IT budget is the amount of money spent on an organization’s IT professionals, systems
and services. Furthermore, IT budget includes the costs of maintaining and constructing
enterprise systems and supporting IT services [3]. An important difference between IT
budget and other traditional budgets is that not all IT spending fall within the IT
department (i.e., it is controlled by business divisions instead of IT).
IT budget contains compensations for IT professionals, both employees and
external consultants. Other common expenses are related to building and maintaining
back office systems (i.e., systems dedicated to running the company itself). For
instance, Enterprise Resource Planning (ERP), accounting, finance, human resources,
and in some cases even, Customer Relationship Management (CRM).
IT budget does not only encompass systems, but also the hardware in which those
systems run. Typically, an IT budget includes all sort of hardware that is required for
the company (e.g., laptops, servers, networking equipment, cloud services).
Visualizing IT Budget to Improve Stakeholder Communication
Usually, IT budgets are prepared by a centralized IT office. However, it is common
to find parallel IT related costs in business budgets or other departments such as sales
and marketing, research and development, operations, among others. This leads
organizations to consider several points of view while preparing and assessing IT
budgets. The next section describes the points of view considered for this research.
However, other perspectives exist and should be addressed to provide an all-embracing
view of IT budget [4].
IT Budget Perspectives
To develop this proposal, we considered different points of view [5]. To define the
perspectives related with IT budget, we identified a number of stakeholders that are
involved in the budgeting process and we collected their main concerns. The main
findings include:
• Customers of IT Business Services. These stakeholders are concerned with the
impact of budget cuts in IT business services. IT supports the execution of internal
business processes, and these stakeholders do not want negative effects on cost,
utility or even service warranties.
• Financial Managers. These stakeholders are concerned with the financial aspect of
IT services and their components. The main sub-processes concerning these
stakeholders are: budgeting (i.e., plan future IT expenditures), IT accounting (i.e.,
capital and operational cost management) and charging (i.e., assign costs of an IT
Service proportionally and fairly to the users of that service).
• IT Leaders. These stakeholders are primarily concerned on technical aspects of IT
business services, and how budget changes will affect their ability to support IT
• Project Managers. They are concerned on the IT budget required for their projects,
and the possible impacts on their accomplishment if budget cuts are required.
As it can be seen, several of the concerns are shared between stakeholders.
However, they have different ways to observe data in order to make decisions (i.e., they
can see the same data but derive different conclusions).
Technical Regulations, Standards and Frameworks
Government comptroller entities around the world recognize that information technologies have become an essential tool used to provide services [6]. However, in some
cases, state-owned companies question their own IT investments.
Technical regulations from government comptroller have been published to optimize and monitor financial resources invested in IT, to control effectively these
resources, and to observe standards and frameworks. These technical regulations
integrate practices of several reference models including ITIL [7], COBIT [8], TOGAF
[9, 10], and ISO 27000 [11].
A. Pacheco et al.
All these set of practices for IT management, are used for regulatory purposes,
since they are supposed to optimize IT services. In order to standardize our enterprise
visualization IT budget system, we used terminology from these models.
IT Business Services are services that allow customers to achieve goals without the
ownership of specific costs and risks [7]. To support the provisioning of services to end
users, people (insourcing and outsourcing), process and technology (hardware, software and infrastructure) are needed. In this paper, we present a system or tool to
visualize the impact of IT budget changes on IT Business services. This system is
considered to be critical for a big enterprise since business continuity is vital and
business change need to be considered.
Business continuity ensures that the firm can continue to obtain value from its
products and services through such action as process automation, product or service
development, service provisioning, among others. Business change delivers value
when some change in the business model, process or product/service is enable or driven
through IT [12].
We used terminology from the TOGAF® framework metamodel. TOGAF metamodel provides a set of entities that can be captured, stored, filtered, queried and
represented. These characteristics allow consistency, completeness and traceability [9].
When properly applied, TOGAF metamodel allows companies to find the answer to
questions such as: Which functionalities support which applications? Which processes
will be impacted by which projects? and other important questions to estimate budgeting impact on the company’s IT services. Furthermore, entities have associated
relations and metadata that allow queries [9].
Up until now, we have been addressing entities, relations, core data and metadata.
All this information is normally stored in different sources along organizations.
However, a natural way to think of this information is a graph. Even though data
sources could be relational databases, digital documents, among others; the information
within an organization is so coupled, and usually there are so many missing bits, that a
graph is a reasonable solution to store it. The next subsection introduces Neo4j, a graph
oriented database management system (DBMS). We describe Neo4j since it was used
to implement our visualization system.
Neo4j: A Graph Oriented DBMS
Neo4j is a scalable graph database designed to support not only data storage but also
relations [13]. Neo4j stores and process information in real-time [14]. We decided to
implement with Neo4j due to the amount of connections (i.e., large amount of join
operations) required for our application. The volume and variety of data were also
considered [15]. It is not enough to store data and relations, it is also necessary to
access it in an efficient and effective way. Cypher is one of Neo4j’s query languages.
This language provides access to data stored in the DBMS through non complex
queries, very similar to Structured Query Language (SQL).
Finally, once the necessary data has been structured and stored, different ways to
visualize it are necessary to provide real value for the IT budgeting process. The next
section presents a brief introduction to visualization techniques.
Visualizing IT Budget to Improve Stakeholder Communication
Visualization Techniques
Data-Driven Documents (D3) is a novel representation-transparent approach to visualization. With it, designer selectively bind input data to arbitrary document elements,
applying dynamic transformations [16]. D3 provides a decoupled solution; therefore,
information can come from multiple sources and only requires a structuration process
in order to be displayed.
Collapsible Tree
Circle Packing
Hierarchical Bars
Fig. 1. Illustration of some of the available visualization techniques at D3.js [16]
Some visualization techniques presented in Fig. 1 can be used in the visualization
of hierarchical data and others for relationship data. Also, they will represent different
attributes of each piece of data. It is important to point out that most of this data-driven
visualization techniques are dynamic (i.e., users can interact with the visualization and
it will change the information displayed). All the concepts described in this section
were detailed because they are used in the context in which our budget visualization
tool was conceived. The next section describes such context.
3 Context
In this paper, a budget visualization system that emerged from a multi-industry
state-owned company, with more than ten thousand employees, is presented. The
company has an IT department and many local IT areas.
The company’s IT Department is in charge of assuring that IT activities and
investment complies with regulations and standards, and of supporting all three
management levels in their technological requirements. Figure 2 shows the company’s
structural diagram.
A. Pacheco et al.
Industry 1
Local IT (funded
by Industry 1
Industry 2
Local IT (funded
by Industry 2
Company´s IT
Fig. 2. Company’s structural diagram
The company’s IT budget is significant compared with total company budget and
most of the expenses go to third parties providing services or products (e.g., outsourcing of software development and maintenance, outsourcing of solution operation,
solution and infrastructure providers). Rigid control structures are used to manage and
monitor incomes and expenses in the organization.
Furthermore, since it is a multi-industry company, information has been managed
in silos (i.e., each industry segment managed information separately). This practice is
also present inside each management department.
Efforts to comply with government regulations for IT started several years ago, and
IT related practices have different levels of implementation, most of them based on
ITIL and COBIT. These efforts have driven budgeting practices for IT looking to create
a holistic vision of IT budget.
The IT budget formulation is performed annually in a distributed way (more than
fifty department managers are involved in this process). Each Local IT and the IT
Department formulate the IT budget, and that information is registered in the company
budget system. Mandated by government comptroller, the IT department must attach an
IT plan to explain the IT budget. Given this requirement, a format and tool for collecting detailed information about IT budget was defined and implemented two years
A major disruption occurred when one of the industries in which the company
performs moved from a monopolistic industry to a competitive one. This leads to
several strategic competitive efforts, and large investment to optimize budgeting and
investment. With these context characteristics in mind, our project tries to answer the
following research questions:
1. How can the IT department convince senior management and other stakeholders
that IT budgets must sometimes exist?
2. How can the impacts of IT budget cuts be showed to management in order to avoid
them or at least manage them wisely?
During the first assessments to answer these questions, we realized that data is the
key factor. However, it is not the data itself what can change managerial decisions, but
how data is shown to each stakeholder.
Visualizing IT Budget to Improve Stakeholder Communication
Our visualization tool allows dynamic change of data and visualization techniques
to support IT budgeting process. Moreover, the possibility to visualize budget information allows to show the comprehensiveness of IT, and provide it with visibility
across the organization. In order to do it, we developed a solution to support datadriven documents as using a visualization model. The data currently used for budged
decision-making was extracted and transformed to meet the requirements of a graph
database, and a set of queries were established to determine the applicability of our
4 System Overview
Our budget visualization system provides a mechanism to picture corporate information through different visualizations. We took the available data sources and created an
Extraction, Transformation and Loading (ETL) process that uses a graph-oriented
database (Neo4j as DBMS) as repository. With data loaded into the new DBMS, users
can select both the data and the visualization technique they want to use. Finally, a
view assembler prepares the query results to be displayed by a dynamic data-driven
The visualization technique must be coherent with the data that will be shown
(i.e., a hierarchical display may not be used to visualize disjoint information). The goal
of this solution is to provide a way to make data-driven decision-making in the IT
budgeting process. Figure 3 shows a component diagram of our proposed system.
Further subsections describe in detail each one of its components.
Fig. 3. Budget visualization system component diagram
A. Pacheco et al.
Data Sources
To identify relevant data sources, we interviewed different stakeholders (members of
the company described in the context section). The main identified data sources
included: company budget system, IT service catalog, system of IT federated planning
and procurement initiatives plan and their formats were relational databases, Excel
sheets and Word documents. To establish the documentation of IT budget, it is necessary to consider a number of stakeholders that are involved in IT budget management, and their typical concerns. Such stakeholders include but are not limited to:
customers of IT business services, financial managers, IT leaders and project managers.
Since the implementation of our visualization system required an ETL process, we
decided to adapt the transformed data in order to align it with ITIL concepts and
TOGAF metamodel.
ETL Process
Data sources were manually identified. For each data source the information had to be
analyzed in order to transform it, and load it, in the new repository. We created a data
model that allows mapping of identified data sources into a new repository.
Figure 4 shows our proposed data model. It contains components derived from
ITIL and TOGAF. The core concept (i.e., the one that will drive most of the queries) is
IT Business Service. The data stored in this component is in business language. The
proposed data model has 4 domains: business, financial, technical and supply chain.
The business domain includes strategic business goals, business drivers, IT strategic
lines and priorities, IT business services, organizational units, IT service categories
(i.e., an index). Work packages are a generic component that encapsulates programs,
projects or other type of works.
Technical contribution to business can be distinguished through service delivery.
This is why the main relation between the business domain and the supply chain and
technical domain is ‘‘requires’’.
Technical domain encapsulates technical services and components. These items are
essential to deliver business services. The supply chain represents purchases from third
parties and providers. Purchase initiatives (defined during the budget negotiation) and
purchases (carried during the budget execution period). Both purchases and initiatives
are modeled through technical budget components.
The financial domain includes expense categories used to classify all the technical
budget components that are mapped to purchases and initiatives. Our data model
establishes the roadmap that connects business strategies to financial resources and IT
All the information extracted from data sources is transformed in order to align it
with our data model and loaded into the Neo4j database. Using this approach allows
and eases the evolution of our model. The incorporation or removal of components
would not affect data integrity. Therefore, as soon as fundamental concepts related with
the ones mapped in our data model are identified they can be incorporated.
Visualizing IT Budget to Improve Stakeholder Communication
Fig. 4. Data model defined to map identified data sources into one graph oriented database
Query Engine
Our system is built to allow parametric queries. Users can filter and adapt the query to
retrieve data concerning a specific element in our data model or to aggregate information for more general purposes.
Even though our system allows the traditional CRUD (create, retrieve, update and
delete) operations, we designed 14 specific queries. The implemented queries were
designed after observations of the negotiation process carried out by several stakeholders in the IT budgeting process. Table 1 shows the implemented queries, and the
stakeholders that might be interested in the information results of those queries.
Table 1. Stakeholder’s common concerns. Stakeholders are: (1) Customers of IT Business
services, (2) Financial manager, (3) IT leaders, (4) Project managers
IT Business Service for a given service level
IT Business Services impacted by a given initiative
Budget of IT components and Technical services
IT components and Tech. Serv. services required for a IT Business
Purchase initiatives that generate CAPEX
Purchase initiatives that generate OPEX
CAPEX amount
A. Pacheco et al.
Table 1. (continued)
OPEX amount
Expense categories impacted by a given initiative
IT budget according to general ledger
IT budget for each IT component or service associated with a given
expense category
IT budget for each IT component or Technical services
IT budget for each work package
Work packages impacted by a given purchase initiative
Visualization Techniques and View Assembler
Our proposal provides a set of visualization techniques, but it also allows anyone to add
more techniques as long as the visualization fits the data structures. The ETL process
modifies data in order to allow both generic and highly specific (ad-hoc) visualizations.
All of these are based on the data model proposed.
To test our proposal, we used some of the available data-driven documents [16].
Figure 5 shows some examples of the query results from our system implementation.
Functional testing showed that once the data is loaded to the data-driven document, the
interaction is easy and useful to gain insight on the data being visualized. Moreover,
functional testing demonstrated that the same information can be visualized with different visualization techniques. Therefore, our system can be used by different stakeholders as it adapts to their requirements. To illustrate our tool functionality, Fig. 5
presents the IT budget according to general ledger. All figures presented in this section
are screenshots from web browsers, displaying dynamic charts.
Visualizations presented in Fig. 5 allow financial experts to, in one picture,
determine which industry has more expenses, and in which expense category each
industry-own company is prominent. Using these visualizations in a budget negotiation
session, the circle packing visualization is a tool to understand the big picture. The
hierarchical bar charts can be used to answer questionings or to provide a detailed
explanation of expenses and budget perspectives.
Notice that the data visualized in Fig. 5 is the same in all three visualizations. They
are provided since multiple ways to visualize data can allow understanding between
different stakeholders. Our system allows change between visualizations with a couple
of clicks.
Figure 6 shows all four perspectives of one IT business service. This figure allows
understanding between all stakeholders. The left most lines represent the business
perspective, classifying the industry and an IT Business Service. Center left presents
(PIs purchase initiatives) represent the supply chain domain and its details (PI-L
Components of purchase initiatives). The technological components (technical domain)
and general expense categories (financial domain) are represented in the right most
lines respectively.
Visualizing IT Budget to Improve Stakeholder Communication
Fig. 5. IT budget according to general ledger. Hierarchical bar charts demonstrate data (left:
high level information, right: drill down). Bottom image represent same data using Circle
Packing visualization. A hover action on any component will show a tooltip.
Fig. 6. Sankey Diagram of budgeting information for one IT business service. Zoom in top right
corner shows the tooltip that describes the gray connection specific amount.
A. Pacheco et al.
In the Sankey diagram, connection thickness represents the strength of the connection (e.g., the amount of money). The dynamic aspect of this visualization uses
tooltips to show further details on the connections or lines, depending on where the
user places the cursor.
Figure 6 can be used to determine which services will be impacted if there is a
budget cut. For instance, if senior management suggests a cut of purchase initiative 15,
immediate mapping can be performed to SW Licenses that also affect SW&HW. In this
way, stakeholders can determine how a budget cut will affect IT’s capacity to maintain
business services or other IT-related tasks.
5 Discussion
In this paper we presented a budget visualization system that extracts and transforms
organizations distributed data, and allows multiple ways to visualize it. We implemented a general enterprise wide visualization tool (i.e., usable by many stakeholders);
however, four perspectives were tested.
Our solution provides mechanisms to incorporate new visualizations and data
sources (technical abilities are required to achieve it). While designing our system we
realized that stakeholders are very focused on their ways to visualize information, and
they do not always care about other perspectives. A design effort was conducted to
assure that the same information can be accessed and understood by multiple stakeholders, either in a negotiation process or just to analyze corporative budgets. This
effort allowed the creation of an integrated data model (see Fig. 4) that eases communication between stakeholders.
To boost not only the use, but the acceptance and proliferation of our system in the
organization, we considered multiple agents of each area represented. We explained
why data should be combinable. Moreover, we presented the benefits that each of them
could gain from a common understanding and shared body of information (i.e.,
common language).
It is crucial for the maintainability and sustainability to our system, that someone
within the organization, keeps track of the big picture. This person, or group of people,
does not only need to assure continuous information supply, but also they should
maintain data relations and information coherence.
As it was stated, our system was developed in the context of large multi-industry
state-owned company with rigid control structures and segregated information (due to
silo mentality). We believe that our tool is an example of collaboration that could be
used as a reference to boost collaborative practices in this kind of contexts.
Our system uses multiple layers. This provides it with the flexibility that it requires
for adapting to different contexts. Moreover, it allows the incorporation of emerging
visualizations and data sources. We believe that our solution is applicable in contexts
other than budgeting. However, further analysis is required to assess its applicability.
We tested our system with real data from a real organization, and demonstrated that
our solution fits complex scenarios, and provides stakeholders with new and adaptable
ways to conduct their tasks.
Visualizing IT Budget to Improve Stakeholder Communication
In the context of a public institution, our budget visualization tool allows, improves
and promotes data-driven decision-making. It guarantees proper budget related decisions and enhancing transparency both within the organization and towards comptroller
A natural path for this research is to include technical debt into our data model.
Most of our data model components allow a direct mapping to technical debt, since
they involve cost management of purchase initiatives that could also encompass outsourcing contracts or even software development. As Guo, Oliveira and Seaman [17]
define it, technical debt can be represented as a list of delayed task that may cause
maintenance problems in the future. Authors state that technical debt must be quantifiable. We believe that it does not only require being quantifiable, but also displayable, contextualizable, linkable and referable with the other components of our data
To operationalize information’s integration, several factors are required, including
but not limited to:
• A semantic model that unifies concepts even if stakeholders use different words for
the same notion. This semantic model could match the core concepts in order to
allow a holistic view of the company’s information.
• A standard format (syntactic) product of the ETL process, that wraps data sources
and allows a decoupling from specific implementations of data repositories.
• Well defined protocols, to supply and update the information contained in the new
This research concluded the preliminary evaluation of a budget visualization system implemented to be deployed in a large multi-industry state-owned company.
Future work includes the expansion of the visualization tool to different domains (other
than IT budgeting), and assess the viability of deploying it in other companies with
similar conditions.
Acknowledgment. This work was partially supported by Research Center for Communication
and Information Technologies (CITIC) at University of Costa Rica. Grant No. 834-B4-412.
1. Bragg, S.M.: Budgeting: A Comprehensive Guide. Accounting Tools (2014)
2. Bowen, M., Morara, M., Mureithi, S.: Management of business challenges among small and
micro enterprises in Nairobi-Kenya. KCA J. Bus. Manag. 2 (2009)
3. Weill, P., Aral, S.: Generating premium returns on your IT investments. Sloan Manage. Rev.
47, 39–48 (2006)
4. Organisation for Economic Cooperation and Development: OECD Best Practices for Budget
Transparency. http://www.oecd.org/gov/budgeting/best-practices-budget-transparency.htm
5. Gartner: Manage four views of the IT Budget (2012)
6. ESTEP: State-Owned Enterprises in the European Union: ensuring level playing field (2013)
7. Long, J.O.: ITIL® 2011 At a Glance. Springer, New York (2012)
8. ISACA: What is COBIT 5? http://www.isaca.org/COBIT
A. Pacheco et al.
The Open Group: TOGAF 9.1 Chapter 34. Content Metamodel
The Open Group: TOGAF®. http://www.opengroup.org/subjectareas/enterprise/togaf
ISO: ISO/IEC 27000. http://www.iso.org/iso/home.html
Curley, M.: Introducing an IT capability maturity framework. In: Filipe, J., Cordeiro, J.,
Cardoso, J. (eds.) ICEIS 2007. LNBIP, vol. 12, pp. 63–78. Springer, Heidelberg (2008).
Neo4 J: Neo4j: The World’s Leading Graph Database. http://neo4j.com/
Holzschuher, F.: Performance of graph query languages comparison of cypher, gremlin and
native access in Neo4j, pp. 195–204 (2013)
Hunger, M., Boyd, R., Lyon, W.: The Definitive Guide to Graph Databases for the RDBMS
Developer. Neo Technology (2016)
Bostock, M.: D3 Data-Driven Documents. https://d3js.org/
Guo, Y., Spínola, R.O., Seaman, C.: Exploring the costs of technical debt management –
a case study. Empir. Softw. Eng. 21, 159–182 (2016)
Implementing an Event-Driven Enterprise
Information Systems Architecture: Adoption
Factors in the Example of a Micro Lending
Case Study
Kavish Sookoo, Jean-Paul Van Belle(&), and Lisa Seymour
Department of Information Systems,
University of Cape Town, Cape Town, South Africa
[email protected]
Abstract. Event-driven architecture (EDA) is an architectural approach for
enabling communication between distributed enterprise information systems.
EDA enables organisations to be adaptable, flexible and robust in the management of business processes and ultimately achieving agility. This paper
reviews definitions, concepts and adoption criteria for an EDA and a case study
investigates the adoption of the EDA in a micro lending organisation exploring
the technological, organisational and environmental adoption factors.
Keywords: Event-driven architecture
Distributed EIS SOA IS adoption
1 Introduction
Organisations need to be agile in order to accommodate changes to environmental
conditions and customer demands [1]. To achieve agility, organisations are obligated to
react to opportunities and pressures in order to continuously monitor and optimise
business processes. In order to do this, organisations are dependent on the underlying
information systems to make decisions pertaining to these business processes [2].
In the development of enterprise-wide information systems, communication, both
internal and external to the organisation, is required. This communication is not only
between information systems but within the components of the information system [3].
To facilitate this communication, information systems are reliant on events that are
responsible for the triggering of business processes. Event-driven architecture (EDA) is
an emerging information systems paradigm and architectural approach that facilitates
the communication between disparate information systems [4].
The research problem is that little is known about the drivers and issues relating to
the implementation of EDA in a real-world context. The goal of this research is to
provide a deeper insight into the adoption of an EDA, and particularly relating to the
Technological, Organisational and Environmental (TOE) factors which can drive the
organisational adoption of EDA. This is achieved by exploring the adoption of an EDA
within the context of a Micro Lending Organisation (MLO) in the form of a case study.
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 293–308, 2016.
DOI: 10.1007/978-3-319-49944-4_22
K. Sookoo et al.
2 Literature Review
The event-driven paradigm represents a real time occurrence of an event within a
domain. This event could be any internal or external activity performed and can initiate
a single or multiple actions. Business events are procedures that affect the organisation’s business processes. These business events are a change in state, related to the
business processes. This state could be a customer interacting with a product, a change
in the operational environment or a reaction to competitors [5].
EDA is a methodology that allows for the production, detection, consumption and
reaction to events. These events are propagated in real time to all event generators and
event consumers. Multiple event consumers may respond to an event initiated by an
event generator. Events are transmitted between information systems, through decoupled service components [6]. Complex event processing (CEP) involves the evaluation
of a number of combined events and performing an action on these events. These
events may be related casually, temporally or spatially [7]. A common usage scenario
for CEP is to respond to business anomalies, threats or opportunities.
The design of an EDA relies on the publish-subscribe pattern to enable the
decoupling of components. An EDA allows asynchronous communication of events to
coordinate the analysis of multiple data streams in real time [8]. EDA allows the
organisation to respond in real-time to events without limitations imposed by technology. These events deliver information constantly, promoting information sharing,
preventing information hiding and allowing information to be acted upon by relevant
business stakeholders [9].
Why Adopt an EDA?
The rise in real time systems has increased with the advent of mobile computing [10].
Event-driven design using the publish-subscribe pattern allows for communication in
the form of “push” messages [11]. This removes the need for polling of events, taking
control away from lower computational devices to large event processing systems. This
impact in reversing control, allows for a shorter duration of processing of events and
increases the real time responsiveness of systems [9].
An important aspect of EDA is decoupling. This allows for generators and consumers of events to operate independently. Generators can produce events with consumers disconnected and consumers in turn can handle events while publishers are
inactive [12]. By employing the use of a decoupled communication mechanism, EDA
allows for greater flexibility, ease of configuration and reconfiguration [13].
The asynchronous model of an EDA, allows event generators and event consumers
to publish or consume events without waiting for a response. This allows generators
and consumers to consume and generate events without the need to wait for one
another [12].
The decoupling and asynchronous nature of an EDA increases scalability by
removing the dependencies between event generators and event consumers. These
properties result in the methodology being well adapted to large-scale distributed
environments [14].
Implementing an Event-Driven Enterprise Information Systems
EDA Capability
EDA has been researched for its ability to secure competitive advantage in the
healthcare industry [8]. The research reviews the relationship between EDA capabilities, dynamic capability and competitive advantage in healthcare organisations.
Capabilities are organisational capabilities and information technology (IT) capabilities
[1]. Organisation capability is defined as the competency to coordinate the organisation’s resources effectively to achieve corporate performance. IT capability is defined
as the ability to mobilise and deploy IT resources in combination with organisational
resources and capabilities. Organisations with high IT capability tend to outperform
competitors on a number of profit and cost based performance measures [1]. The
research defines EDA capability as the “ability to propagate the real-time events to all
interested targets automatically and to support them to evaluate and then make decisions optionally” [8]. The EDA capability is sub-categorised into capabilities reflective
of EDA design principles. Table 1, describes each of the capabilities along with the
correlating design principle research. Design principles are associated with capabilities
such as by adopting an EDA, asynchronous communication is realised, resulting in
achieving the sensing capability of retrieving information of real time processes [8].
Table 1. EDA capability dimensions [8]
Refers to the ability of event-driven IT
architecture to recognize event-triggering
information and to provide managerial
visibility into the business processes in real
Emphasizes the ability of the IT architecture
to support the manager’s decisions and
actions at each organizational level
Provides a system platform which performs
the exchanging and sharing of data on two
or more software components
Enables an organization to speed
operational changes and promote a high
degree of business agility via IT
architecture modularity, compatibility and
Design principle
Publish-subscribe pattern
Publish-subscribe pattern
Adoption Factors for EDA
The literature has established that technical, business and human factors are relevant in
the adoption of Service Oriented Architecture (SOA) [15]. Many of these factors are
also valid in the context of EDA. Thus, we reviewed SOA adoption factors [16]
detailed in Table 2.
K. Sookoo et al.
Table 2. SOA adoption factors [16]
Adoption factor
Strategy and goals
Financial implications
and benefits
IT agility-business
Human resources
Risk management
A clear overall strategy and goals are required in the initiation
phase. The strategy includes business and IT to meet critical
business goals related to SOA adoption
The financial implication and benefits for adopting an SOA,
examines the revenue streams, costs and return of investment. In
increasing revenue streams, SOA may provide increased revenue
or decreased expenses from increased reliability offered by SOA.
In addition, the time to market new services for customers is
reduced due to the reusability features of SOA. In adopting an
SOA, cost such as new hardware, software and human resources
needs to be established to fully assess the potential factors
Aligning business processes to an SOA approach requires effort
from both IT and business
Effective communication between different departments among
the bank and vendors are essential for SOA adoption success
A culture that facilitates SOA is vital to its adoption
Training current staff and introducing staff competent in SOA is
necessary for SOA adoption
In introducing a new methodology, risk is an inherent factor. Risk
management and monitoring should be on going in the adoption
Organisations adopting an EDA need to understand the requirements pertaining to
the adoption before proceeding. Adoption factors of event-driven SOA include the
ability to have an established and reliable hosting environment to facilitate the communication of events across multiple applications in parallel [17]. Without this, a
bottleneck may occur resulting in event information delayed and inherently causing
performance impediments [17].
Another adoption factor is to have an ordering mechanism for events. This is to
ensure that dependent information is delivered in the correct order for critical business
processes [9, 17]. The lack of standards in EDA is adversely affected by
non-standardization of event messages. Adhering to a particular event standard is a key
requirement in the successful implementation of an EDA [9]. Currently event messages
are developed according to particular vendors as opposed to an EDA standard [17].
Other factors that are a requirement for the success in the adoption of an EDA focus
around the distributed environment. The recovery from failures in applications and in
the hosting environment needs to be taken into account following the adoption of an
EDA [10, 17]. Failures in terms of exceptions in event flows is also difficult to diagnose
and a proper strategy and monitoring tools are needed for an EDA [18].
The technical aspects related to successful adoption of an EDA or any other
methodology are compounded by the factors related to human resources and skills [5].
Having the appropriate skill levels and the human resources capable of understanding
an EDA is paramount to its adoption. This is particularly important in CEP applications
that have several events that are required to complete an action [16].
Implementing an Event-Driven Enterprise Information Systems
Micro Lending
Access to credit is key to economic development and improving the standard of living
for many people [19] and during times of crises is a fundamental aspect of recovery
[20]. Traditional financial institutions require individuals to provide collateral in
obtaining a loan or credit. This is normally out of reach for the poor [21]. Hence micro
lending, for many years, has been a predominant credit lending facility in the informal
sector. In order to accommodate the need for credit, a digital approach was taken and
online short-term loans were established as the preferred means of credit lending.
Micro lending in the form of short-term loans is a relatively new market segment.
A short-term loan is a monetary credit with a term not exceeding a single month [22].
Having access to credit is important in times when natural disasters such as floods,
hurricanes and tornadoes occur. In the United States those affected by natural disasters
and not compensated timeously by insurance companies have utilised the services of
short-term loan providers to obtain much needed financial assistance [20].
In recent years short term loan providers have gained a reputation for misdirecting
credit seeking consumers and charging higher than normal interest rates and fees. This
resulted in an increase in legislation and regulation by governments and credit lending
bodies [22]. Short-term loan providers are required to mitigate risk and to verify
customers that are applying for these loans. To enable this verification, a number of
information systems both internal and external to the organisation are queried for
different criteria. This includes the verification against legislation requirements and
validation of personal information and income. The querying of systems occurs in real
time and obtains a decision on lending in a relatively short period [22]. The criteria
given above and the adoption factors presented, advocate for the use of an EDA in a
micro lending organisation.
3 Research Framework and Methodology
The research asked the question of what factors are required for the adoption of an
EDA within a micro lending organisation. The sub-questions include:
• What technical and environmental factors did the adoption address?
• What concerns were there in the adoption of an EDA?
• How did the adoption affect the organisation and the stakeholders?
Determining the adoption of information systems is a common research area. There
are a number of frameworks, models and methodologies that aid in determining how
specific variables affect the adoption of a specific technology within an organisation
[23]. To investigate the factors influencing EDA adoption, the research used the TOE
framework [24]. The framework identifies three contexts, namely the technology,
environment and organisation contexts. TOE is regarded as a framework that provides
a classification of variables rather than a fully integrated conceptual framework. To
enrich the TOE framework variables from other areas such as sociological, cognitive
and technology readiness need to be incorporated. This is usually done by combining
the TOE framework with other theories such as Diffusion of Innovation [23]. The
research used these contexts in order to investigate factors retrieved from the literature.
K. Sookoo et al.
The research has been conducted in a post positivistic manner. Post positivistic
research is a paradigm that moves from the narrow perspective of positivism, enabling
the examination of real world problems. This is evident in post positivism, in that it
recognises that diversity and complexity are part of all experiences [25]. The research
follows the deductive approach [25], using the TOE framework in conjunction with
factors derived from literature to investigate the adoption factors of EDA.
The research is in the form of a case study. The research investigates a single case
in an explanatory manner. The theoretical motivation is the exploratory nature of the
research. A practical reason for the particular case study selected is that one of the
researchers is employed by the organisation and has access to stakeholders within the
organisation to perform interviews and to obtain valuable information. Permission was
obtained from the micro lending organisation management to conduct the research and
ethics approval from the research organisation (University).
A cross-sectional approach was taken, i.e. a snapshot of the status in the organisation in August 2015. However, some of the historical evolution and experiences were
included in the case study description and interviews.
Fifteen candidates were interviewed in semi-structured interviews. These included
software engineers and a team lead who actively maintains and develops the EDA
solution, a software development manager, a product owner that is responsible for new
product features, the head of operations, the head of marketing and executive management. The international micro lending organisation has a geographically dispersed
team and as a result some candidates were interviewed in person and others were
interviewed with the use of video or voice calling. The questions varied depending on
the interviewee’s role in the organisation and knowledge of the EDA system.
Interview responses were analysed using thematic analysis and using NVivo
software. During phase one, the researcher read the results and transcripts to become
familiar with the content in order to identify patterns. During phase two codes were
identified and documented. During phase three codes were linked to themes. Phase four
determined how the themes support the results obtained and the factors presented by
the TOE framework. If these themes were incomplete, the process began from phase
one until all themes were identified. Phase five determined the significance of each
theme and the significance of the results obtained.
4 Case Study Description
Organisational Context
The case study was performed at a MLO which is based in South Africa but part of an
international organisation. The MLO is a pioneer in the micro lending space by being
the first to introduce short-term loans online. The MLO was started in 2007 and has its
headquarters in a major European city. Two South Africans, one a software engineer
and the other an entrepreneur, founded the MLO. The founders wanted to develop a
lending model that provided short-term credit in a transparent manner and with
immediate access to funds. Furthermore, the solution should provide customers with a
paperless and prompt response on the credit decision.
Implementing an Event-Driven Enterprise Information Systems
In order to provide for the prompt decision-making, the organisation developed a
fully automated decision processing technology to determine the customer’s risk
profile. The risk model and algorithms use data points that access a number of online
credit providers using application-programming interfaces (API). Once a positive
decision for lending is provided from the decision engine the funds are transferred to
the customer’s bank accounts.
The improvement of the decision engine and algorithms led to the expansion of the
MLO to several countries. This led to a number of legislation and regulation challenges
as each country has its own credit regulations. In the country of origin, the MLO was a
disruptor and at the time, there was not adequate regulations and legislation on
short-term lending. The South African branch of the organisation however was subjected to a lending environment that has stringent regulations and legislation. The
National Credit Act (NCA) overseen by the National Credit Regulator (NCR) governs
these regulations [19].
EDA System Architecture
The architecture of the MLO is based upon the need to move from a monolithic
traditional system to a distributed system. The concepts around this distributed system
include the use of a service bus encompassing a messaging paradigm. A service bus is
an architecture pattern that facilitates the messaging between multiple services and
applications. The communication between services and applications is at the lower
level and provides a means to connect predominately XML based messages [9]. The
service bus utilised by the MLO is based upon NServicebus a distributed methodology
that is implemented using Microsoft’s .NET framework.
In order to accomplish the architecture a number of design patterns were used. The
publish-subscribe pattern provides for the implementation of event generators and
event consumers. Another pattern that is featured in the MLO design of the distributed
system is that of the Command Query Responsibility Segregation (CQRS) pattern. The
pattern aims to divide the obtaining of data and changing the state of the data in a
distributed system. The benefit of this is to distinguish between the actions that alter the
system and those that require information from the system [26].
The interaction of the user interface to the services of the system is via a service bus
using XML. The EDA solution of the MLO is comprised of four service contexts. The
payments service is responsible for the processing and collection of customer payments. The Decision service is responsible for the verification of customer details and
accessing the affordability of the customer. This service interacts with credit providers
and the decision engine to obtain the data points of the customer. The communication
service is responsible for the communication channels to the customer. This integrates
with an external marketing platform that enables communication using email, SMS and
social media.
The technology utilised forms part of the current architecture that was developed in
2011. Due to the initial version of the monolithic system developed using Microsoft
technologies, the decision taken was to continue using the existing technology stack.
This is due to having the skills familiar with the Microsoft technologies.
K. Sookoo et al.
The distributed enterprise software framework NServicebus uses Microsoft’s .NET
framework [27] and uses a point-to-point service configuration where each service
subscribes to the service address or endpoint to publish or subscribe messages.
NServiceBus works in conjunction with a messaging system. The messaging system
that is utilised within the MLO is Microsoft Message Queueing (MSMQ). MSMQ is
found on almost every version of windows operating systems and allows the communication of messages between heterogeneous networks and applications [27]. It uses
an underlying data store such as Microsoft SQL Server to perform event sourcing.
A major framework feature is the ability of NServicebus to manage long running
processes known as Sagas. These processes are scheduled to execute for an extended
duration of time and normally involve the use of batch jobs. Sagas save the state of
event messages thereby enabling the framework to provide the capabilities of fault
tolerance and automatic retry [27].
The MLO makes use of cloud services such as Amazon web services for the
hosting of server environments. These cloud services help in alleviating the issues with
having access for a distributed team requiring resources in multiple countries. Other
technologies used separate from the EDA solution is the user interface layer that is
based on a PHP technology stack.
5 Adoption Factors
The discussion of the hypothesised adoption factors is organised using the main categories of the TOE framework. The interview analysis and discussion has been summarized to fit within the space requirements of the paper; a more detailed analysis is
available on simple request from the authors. Indication of whether the factors are
positive or negative influences are indicated by (+) and (−) respectively.
Technology Context
Relative Advantage (+): Supported - This is illustrated by the MLO’s transition from
a monolithic system. The factors of scalability, asynchronous communication and
decoupling were cited as the most important in the adoption. “Typically when someone
decides to use asynchronous technologies they’re expecting scale…which is a very
common in a start-up.” This complements but distinguishes it from SOA: “Generally
what we’re trying to achieve is a service oriented architecture. There’s three ways you
can achieve that one is through designing it in an Asynchronous event driven model
and the other is through a typical web service model. So they both give you a service
oriented architecture but in different ways…” This is confirmed by other technical
participants indicating that scalability, decoupling and asynchronous communication as
crucial benefits in comparison is to traditional monolithic systems and systems that use
synchronous communication.
Implementing an Event-Driven Enterprise Information Systems
Another factor for an EDA as opposed to an SOA, relates to the reduction in
temporal coupling: A concern in the context of the MLO is the integration with third
party services such as credit providers, is that when temporal coupling occurs it will
affect customers. “Another big problem that can be exacerbated even more if you’re
doing a request out to the web you know from an internal system out to a third party”.
In order to solve this issue and move away from the service-oriented approach the
MLO chose to use an EDA that removes the temporal coupling with the
publish-subscribe pattern. “What event driven architecture try to do they try to solve
that problem by reducing the necessity to have those synchronous request response…
service A can just service B can publish an event when the data changes and service A
can listen to that event and store a local copy. Now when service A needs that
information, service A no longer has that temporal coupling”.
Perceived Direct Benefits (+): Supported - The benefits of the adoption of the EDA
are related to the fault-tolerance and reliability of the EDA solution. This is necessary for
the integration with external systems. Participants from the organisations’ business units
indicated that reliability was a major factor in the adoption of the distributed EDA
solution. “You know I think that an event driven architecture you have a little more long
term reliability because you can maintain things in smaller chunks…” with the fault
tolerance capability of an EDA stating that event messages are reprocessed when
recovering from failure. This brings the benefit of having critical business processes
related to customers, are re-executed recovering from a faulted state with little or no
intervention. “You know then the comfort in knowing that messages and things like that
will be resubmitted when you come back online… that is reassuring….and a lot of things
kind of resolves itself which is a big benefit from a business perspective….” A Software
Engineer statement that “the pro to being durability of the system and things being able
to really being able to go down and not ever lose any progress”, confirms this.
Complexity (−): Supported – Asynchronous communication increases complexity for
software engineers. A consensus from the data collected indicated that there is an
increase in the complexity of the solution. Moving from a monolithic system to a
distributed system added to the complexity. “I wouldn’t say it’s because of event driven
architecture I would say it’s more to do with a distributed service orientated architecture and the fact that you know all these distributed components around the place.”
This affects developer productivity: “Complexity to this system makes the learning
curve for new developers a lot higher. By this I mean it can make the debugging a lot
more tricky, harder to see…” A notable issue experienced by the MLO in retrieving
information is a disconnection between all three systems. However, a notable comment
is that the “adoption of an architecture does not necessarily make a business more
complex. I think your architecture should reflect how a business is organised”.
Complexity introduced, is offset by reliability, agility and fault tolerance.
Perceived Financial Costs (−): Supported – Increase in cost for more hardware and
resources. An increase in operational costs is experienced in the maintenance of the
EDA solution. Cost and complexity have a relationship in that due to system complexities, more developers that are skilled are required. “… The cost goes hand in hand
to the complexity because obviously it takes longer…more development hours you need
K. Sookoo et al.
a lot more skills developers in order to function properly on such a system.” Operational costs increase because of the EDA solution require more hardware to achieve its
objectives. These objectives include flexibility and reliability. “From an operational
perspective it is more expensive to run the EDA solution because it runs across more
machines and more hardware. Having said that, it does mean that it’s more reliable.
You cannot get more reliability and more flexibility at a cheaper cost…”
Infrastructure (+): Supported – More personnel and hardware is required for the
adoption. “There’s no specialised hardware required but there are more of it because
of the fact that there are more services spread across more systems that way…” In
terms of the infrastructure to support and maintain the EDA solution additional support
staff was required: “So yes so you would need more sys ops and maybe DevOps people
as well. So yeah in short…you do need more people mostly around specialisations of
DevOps with the environments.”
Technology Maturity (+): Unsupported – Frameworks and tools for an EDA adoption are still improving and are not mature due to a lack of standards. When the
adoption of the EDA solution commenced, not many tools addressed the need for event
messages in a distributed system. As stated above, the solution was based on the
Microsoft’s technology stack. Choosing this solution (NServicebus) was due to the
continuity from the monolithic by keeping with Microsoft technologies. “I think at the
time there wasn’t a hell of a lot frameworks available. Um…I think the only really…
other framework that was really considered is called Mass transit.” However, since
then new open source technologies became available and the current technology
selection has greatly improved. “The tooling I think is quite good you know there’s lot
of open source tools that enable you know co-ordination across services and you know
tools like Nservicebus you know they are quite solid and you know their equivalence
and the units were older were quite good. So tooling I think have become quite good
over the past 5 years.”
Vendors and Tools (+): Partially supported – Tools are categorised around developers, monitoring and business intelligence. The predominate technologies used as
stated previously is Microsoft based. However, there is lack of tooling for advanced
usage scenarios. “The tooling in IDE needs to you know evolve to the point where this
is a standard sort of feature where they you know tie these handlers and things together
and allow you to navigate from you know a loosely coupled one point to the other.”
Tooling for race conditions is not yet established: “You know if you’re trying to figure
out what is the exact sequence of events that caused request to fail. If it’s due to race
conditions for example those can be extremely difficult to solve and to tooling around
solving race conditions is still poor.”
Organisational Context
Human Resource Factors (+): Supported – Adequate personnel with high level of
experience is required when adopting an EDA. “From a hiring perspective you know
we do need people that are a little bit more capable and you know have stronger skills
Implementing an Event-Driven Enterprise Information Systems
than say a simpler system.” However, this brings advantages to the MLO as the
personnel hired have a greater skillset and quality: “I don’t think that’s necessarily a
bad thing for the other advantages that brings you know in terms of just having higher
quality people but it does mean it requires a different skillset.” In terms of a specific
skillset required by software engineers, a number of participants advocated for the
ability and willingness to learn: “In general developer skills would mean willing to
easily learn or ability to easily learn and quickly adopt those are very general ability
skills and – or understanding. Generally these systems are sort of messaging based so
having some knowledge or done some research on how messaging based systems work.
Innovation Capacity (+): Supported – In the context of the MLO innovation is
hampered by the lack of integration. However, an EDA has a capacity for innovation.
“The system definitely is something that scales very nicely for new features. There’s no
limitations to how much you can add to the system…which is good…” The decoupling
of the services provides for agility in adding new features: “Yeah I think it kind of it’s
event driven architecture but it’s also it also relates to your system services. So when
you have these decoupled systems and you have the ability to throw in any service and
listen to events. It will naturally make your system more agile.” Adding new features
and integrating with third party components it is much simpler in performing this
through an event driven approach: “Event driven architecture tends to make using third
parties far simpler and easier.”
Knowledge Capability (+): Partially supported – In the context of the MLO
obtaining knowledge is a challenge due to difficulty obtaining information. Software
engineers that are newcomers to the distributed architecture and in particular to an EDA
solution have to quickly adapt and learn. The MLO has a number of resources available
for software engineers to learn the skills required to develop and maintain the EDA
solution. “Distributed systems experience generally… I think it’s quite poor and the
candidates I see and so generally I’m trying to look for people who have more the
capability as opposed to the existing knowledge and you know there’s a lot more on the
job learning.” A main concentration point is the MLO wiki that contains information
from all the regions on specific details on the system. This is mandatory requirement in
introducing software engineers to knowledge of distributed systems. “At the moment
just feel we relying on so many little bits and pieces to help us with our one core
objective. It’s very fragmented put it that way.”
Operational Capability (+): Supported – A generalisation of this capability is that the
decoupling of an EDA helps in achieving an operational capability by modularisation.
The monolithic system prior to the adoption of the EDA was difficult to maintain and
develop new features. This is stated by the SDM contrasting both systems: “More
monolithic architecture which was based a lot on store procedures and triggers to
cause work to occur and because of that it was an extremely difficult system to maintain
and to develop on and everything was one big code blob and it you know created lots
of problems in terms of having lots of people work on it.” This is in contrast to the EDA
solution: “…so moving to the EDA solution was a huge improvement in terms of you
know having things broken down into smaller pieces and therefore you know little bit
more understanding”.
K. Sookoo et al.
Top Management Support (+): Supported - Top management refers to management
that make strategic decisions and commits resources to the completion of projects to
obtain a strategic goal. Support from management in adopting an EDA is paramount.
Although the management team understand there are complexities, very few understand the extant of those complexities. “They don’t all understand the complexities I
think they all understand that there are levels of complexities… Uh….varies from
person to person some more technical than others and some just really shouldn’t be
worrying about some areas of technicality.” Management has concerns over certain
areas. The main issue is not with the EDA solution but the process and resources in
accomplishing a change to the system. “Management is constantly frustrated by the
length of time it takes to make changes. I think that one of the biggest challenges there
is the lack of continuity…”
Environmental Context
Legislation Factors (–): Partially supported – Micro lending in the digital marketplace is a relatively new industry. Legislation and regulations play an important part on
shaping the micro lending industry. The MLO due to its geographical dispersion has to
contend with multiple jurisdictions. Legislation factors do not have a direct influence
on the adoption: “I don’t think regulation in and of itself is going to be making a
difference to whether one uses an event driven architecture or not”. However, an EDA
helps in achieving flexibility as in the context of the MLO, so from a regulatory
perspective there is an advantage to having a service-oriented approach as opposed to a
monolithic approach: “…so from so from a regulatory stand point – I guess there is an
advantage to service-orientedness as opposed to monolithic publicationess…” since
“agility [is] the ability to uh…react timeously [to legislative changes] without having
to touch every part of the system.”
Competitive Pressures (–): Partially supported – an organisations ability to adapt to
competitive pressures is reliant on the underlying information systems. Although the
architecture itself is not a competitive advantage, it does provide the capabilities to
deliver features and create new products. However, these information systems can be
implemented with different architectures. “…so it’s not the architecture itself that I
think would give the competitor an advantage it’s how you use that architecture.” A
competitive pressure that is experienced by the MLO is competitors that utilise the
service to determine how to better the MLO’s customer experience. “There’s naturally
the ability for your competitors to swoop in and kind of look at your journey and where
things fall flat and be able to add that value to the customer…” Therefore, the competitive pressure of continuously adapting and being agile to prevent customer loss is
reliant upon the underlying applications that in turn are reliant on the architecture.
Customer Mandate (+): Unsupported – factors do not influence the adoption.
However the capabilities offered, provide an advantage. Information and the interaction
of customers provide a means of driving innovation: “Our business is enabled through
technology. In other words, we got a customer that goes onto a website and is able to
Implementing an Event-Driven Enterprise Information Systems
do a bunch of things in order to interact with us.” A more general discussion not
related to the adoption of an EDA but to customers of the MLO is that customers form
part of the core of the MLO and if these customers are not adequately serviced, they
will move to a competitor: “..I think there’s three core effects by not delivering to the
customer. Essentially I mean that there’s a reputation …there’s naturally the ability for
your competitors to swoop in and kind of look at your journey and where things fall flat
third level there’s obviously a sustainability the more customers you start losing and
you know that affects the reputation and the customers defecting to competitors.“Thus
this does not necessarily influence the adoption of an EDA but the capabilities offered
by an EDA provide an advantage.
Correlations Between Factors
During the analysis, we found that there are many of these factors influencing the
adoption of EDA correlate with each other. The analysis showed a correlation between
factors such as complexity and cost, innovation and human resources and vendors,
tools and operational capability. By taking advantage of EDA capabilities, distributed
systems become reliant and agile in accomplishing the objectives of an organisation.
Future research could investigate these correlations in more detail.
6 Limitations
The research conducted a case study within a single organisation. As such, the research
provides a single example of the adoption of an EDA. Since very little research has
been carried out on the adoption of EDA, correlating the results with other research is
difficult. The micro lending case study is carried out within the financial sector and thus
there is little knowledge of how the adoption of an EDA relates with other domains
although if the capabilities offered by an EDA provide an advantage this could be
translated to other domains.
7 Conclusion
EDA provides an organisation’s enterprise systems with many capabilities, including
the publish-subscribe pattern, asynchronous communication and decoupling. These
enable organisations to make real time decisions in dispersed information systems.
Decoupling achieved by employing an EDA improves scalability allowing for
large-scale distributed environments. Capabilities offered by EDA allow organisations
to manage business processes more flexibly by allowing real time access. EDA enables
IT architecture in organisations to respond and take action across the organisation
levels. This is accomplished by allowing interoperability across information systems
and as a result enables flexibility in the organisation. Micro lenders of short-term loans
are well suited to the adoption of an EDA. These micro lenders require an information
system that can process events in real time. This real time processing must provide
verifications and decisions on customers taking out loans.
K. Sookoo et al.
The analysis of the case study affirms the capabilities that were highlighted in the
literature review. These capabilities play an important part in enabling the MLO to be
adaptable, flexible and reliable. The factor of relative advantage highlighted that the
technical abilities of an EDA namely scalability, asynchronous communication and
decoupling are pre-requisites for realising these capabilities. Further to this, an
advantage over SOA is that EDA limits temporal coupling. The benefits of the capabilities allow the MLO and other organisations, adopting an EDA, to have a reliable
and fault tolerant system that can help in achieving its objectives.
However, these capabilities do come at a cost. The cost is associated not only to
financial implications of a distributed system but also introduces more complexity. This
complexity is associated with the increases in infrastructure, resources and tools.
Furthermore, the skills and experience levels of staff is of great importance. From a
software engineering perspective, skills in asynchronous communication is paramount
to the understanding of the system. These personnel are drivers of innovation. A concern raised is that the lack of experienced technical resources hampers innovation. In
the context of the MLO, the generation of information from the EDA to obtain
knowledge, is complicated by the complexity of the data structure of the system.
Managing the structure requires greater skills and resources. Hence this research
advocates for more distributed system knowledge to be introduced at tertiary level and
more is needed to educate entry-level software engineers about distributed
The fit of the technology namely EDA to software engineers and business personnel highlighted a few possible areas of improvement of the MLO. A recurring
concern is around information extraction and the centralisation of information. This has
added implications in the relationship, quality and compatibility of tasks and processes
to the EDA solution.
It is difficult to extrapolate to what extent our findings are generalizable to other
organisations. The case study description given should allow one to draw parallels
between organisational contexts and, to the extent that contexts are similar, we would
think that at least some of the factors will apply as well. Generalising across a wider
sample of organisations, especially in the service industry, would be a recommended
avenue for future quantitative research.
This research should of interest to Information Systems researchers as well as
practitioners and businesses looking to see a real-world application of EDA as well as
the important adoption issues and factors that were experienced in the case study.
1. Chen, Y., Wang, Y., Nevo, S., Jin, J., Wang, L., Chow, W.S.: IT capability and
organizational performance: the roles of business process agility and environmental factors.
Eur. J. Inf. Syst. 23(3), 326–342 (2013)
2. Magoutas, B., Riemer, D., Apostolou, D., Ma, J., Mentzas, G., Stojanovic, N.: An
event-driven system for business awareness management in the logistics domain. In: Rosa,
M., Soffer, P. (eds.) BPM 2012 Workshops. LNBIP, vol. 132, pp. 402–413. Springer,
Heidelberg (2013). doi:10.1007/978-3-642-36285-9_43
Implementing an Event-Driven Enterprise Information Systems
3. Juric, M.B.: WSDL and BPEL extensions for event driven architecture. Inf. Softw. Technol.
52(10), 1023–1043 (2010)
4. Clark, T., Barn, B.S.: A common basis for modelling service-oriented and event-driven
architecture. In: Proceedings of the Fifth India Software Engineering Conference, pp. 23–32
5. Krumeich, J., Weis, B., Werth, D., Loos, P.: Event-driven business process management:
where are we now? – a comprehensive synthesis and analysis of literature. Bus. Process
Manag. J. 20(4), 615–633 (2014)
6. Vezzani, R., Cucchiara, R.: Event driven software architecture for multi-camera and
distributed surveillance research systems. In: 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition - Workshops, CVPRW 2010, pp. 1–8 (2010)
7. Raffelsieper, T., Becker, J., Matzner, M., Janiesch, C.: Requirements for a pattern language
for event-driven business activity monitoring. In: European Research Center for Information
Systems (ERCIS), pp. 1–12 (2012)
8. Ave, W.M., Kung, L., Byrd, T.A.: Leveraging event-driven IT architecture capability for
competitive advantage in healthcare industry: a mediated model. In: Proceeding of Thirty
Fourth International Conference on Information Systems, pp. 1–12 (2013)
9. Luckham, D.: Event Processing for Business: Organizing the Real-Time Enterprise. Wiley,
New York (2011)
10. Nitz, S., Kleiner, C., Koschel, A., Astrova, I.: Applying event-driven architecture to mobile
computing. In: IEEE International Symposium on Signal Processing and Information
Technology, pp. 58–63 (2013)
11. Fan, M., Fan, H., Chen, N., Chen, Z., Du, W.: Active on-demand service method based on
event-driven architecture for geospatial data retrieval. Comput. Geosci. 56(1), 1–11 (2013)
12. Esposito, C., Cotroneo, D., Russo, S.: On reliability in publish/subscribe services. Comput.
Netw. 57(5), 1318–1343 (2013)
13. Barthe-Delanoë, A.-M., Truptil, S., Bénaben, F., Pingaud, H.: Event-driven agility of
interoperability during the run-time of collaborative processes. Decis. Support Syst. 59(1),
171–179 (2014)
14. Zagarese, Q., Canfora, G., Zimeo, E., Alshabani, I., Pellegrino, L., Alshabani, A., Baude, F.:
Improving data-intensive EDA performance with annotation-driven laziness. Sci. Comput.
Program. 97(1), 266–279 (2014)
15. MacLennan, E., Van Belle, J.P.: Factors affecting the organizational adoption of
service-oriented architecture (SOA). Inf. Syst. e-Bus. Manag. 12(1), 71–100 (2014)
16. Basias, N., Themistocleous, M., Morabito, V.: SOA adoption in e-banking. J. Enterp. Inf.
Manag. 26(6), 719–739 (2013)
17. Ghalsasi, S.Y.: Critical success factors for event driven service oriented architecture. In:
Proceedings of the 2nd International Conference on Interaction Sciences: Information
Technology, Culture and Human, pp. 1441–1446. ACM (2009)
18. Clark, T., Barn, B.S.: Event driven architecture modelling and simulation. In: IEEE 6th
International Symposium on Service Oriented System (SOSE), pp. 43–54 (2009)
19. Mashigo, P.: The lending practices of township micro-lenders and their impact on the
low-income households in South Africa: a case study for Mamelodi township. New Contree
65(1), 23–46 (2012)
20. Morse, A.: Payday lenders: heroes or villains? J. Financ. Econ. 102(1), 28–44 (2010)
21. Nkpoyen, F., Bassey, G.E.: Micro - lending as an empowerment strategy for poverty
alleviation among women in Yala local government area of cross river state, Nigeria. Int.
J. Bus. Soc. Sci. 3(18), 1–9 (2012)
22. Bhutta, N.: Payday loans and consumer financial health. J. Banking Finance 47(1), 230–242
K. Sookoo et al.
23. Gangwar, H., Date, H., Raoot, A.D.: Review on IT adoption: insights from recent
technologies. J. Enterp. Inf. Manag. 27(4), 488–502 (2014)
24. Tornatzky, L., Fleischer, M.: The Process of Technology Innovation. Lexington Books,
Lexington (1990)
25. Saunders, M., Lewis, P., Thornhill, A.: Research Methods for Business Students, 5th edn.
Prentice Hall, Harlow (2009)
26. Betts, D., Domínguez, J., Melnik, G., Simonazzi, F., Subramanian, M.: Exploring CQRS and
Event Sourcing. Microsoft, Redmond (2012)
27. Boike, D.: Learning NServiceBus. Packt Publishing Ltd, Mumbai (2015)
Software Innovation Dynamics in CMSs
and Its Impact on Enterprise Information
Systems Development
Andrzej M.J. Skulimowski1,2(&) and Inez Badecka1,2
Chair of Automatic Control and Biomedical Engineering,
Decision Science Laboratory, AGH University of Science and Technology,
Al. Mickiewicza 30, 30-050 Krakow, Poland
[email protected]
International Centre for Decision Sciences and Forecasting,
Progress & Business Foundation, Lea 12B, 30-048 Kraków, Poland
Abstract. This paper reports the results of a prospective study on information
system development trends. It is based on the observation that SMEs seek
opportunities to endow their CMS-based IS with CRM, e-commerce and
ERP/ERM functionalities. Using publicly-available release data, we investigated
the dynamics of subsequent versions and new functionalities of the most popular
open-source CMSs - Drupal, Joomla!, and WordPress. Special attention was
paid to software innovations that make possible the use of CMS-based applications for typical EIS purposes. The software technology race was modelled by
a system of quasi-linear stochastic equations with state variables describing the
upgrade generation time. Two such models have been built and compared for
the above CMSs. Trend extrapolation with vector autoregression allowed us to
predict ERP-related functionality development prospects until 2025. We maintain that the deployment of CMS-based ERP/ERM may have a relevant impact
on business models and strategic ICT alignment in SMEs.
Keywords: Software evolution ICT foresight ERP-CMS
autoregression EIS scenarios Open-source software
1 Introduction
The deployment order and intensity for different information and communication
technologies (ICTs) in enterprises is an important issue from the software market point
of view. It is also a research question relevant to understanding the technological
evolution of Enterprise Information Systems (EISs) and their market development
trends. ICT deployment trajectories in large enterprises have been extensively studied
by many researchers cf. e.g. [3, 8, 9, 14], who have often referred to them as strategic
ICT alignment [4]. However, the ICT investment behaviour of the small and medium
sized enterprises (SMEs) is known in far less detail due to the great diversity of SMEs
structures and business models and a greater flexibility of SME managerial decisions
that often depend on individual preferences. Nevertheless, SMEs are expected to be the
fastest growing market for corporate ICT solutions over the next decade [16]. Software
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 309–324, 2016.
DOI: 10.1007/978-3-319-49944-4_23
A.M.J. Skulimowski and I. Badecka
market research indicates that the most important ICT functionalities for SMEs are
those offered by Web 2.0 CMS technologies. The research presented in [18] provided
clues that enterprise resource planning (ERP) systems based on the collaboration of
developer groups, supported by web browsers and typical CMSs, could become a
significant future trend among SMEs. This conjecture has been confirmed by a needs
analysis performed among enterprises during the initial phase of a foresight project
[16]. In contrast to big companies, which use large and complex multi-module ERP
applications such as SAP, SMEs widely use in-company online solutions [6] based on
open-source CMSs, for cost reasons. Further studies conducted within the above
project [16] showed that small and micro enterprises in particular intend to further
invest in the development of such systems. This trend can also be regarded as a
manifestation of the business model alignment to software evolution in the context of
syntactic integration of different enterprise software systems [15].
As a sample data set, this paper uses the CMS versions and the mutual relations of
ERP and CMS development trends (ERP-CMS) gathered in the project SCETIST [16].
Partial results concerning ERP-CMS have been presented in [17, 18]. A general
technology race model for the most common open-source CMSs Drupal, Joomla! and
WordPress was proposed in the form of a quasi-linear system of stochastic equations
describing software evolution trajectories. The state variables are time lapses between
generating two subsequent software releases of each CMS under consideration. We will
construct, analyse and compare two specific models of upgrade generation processes
for the three CMSs. These models correspond to software development scenarios that
differ on factors taken into account when making a decision to produce a new release
that brings an essential software improvement.
Both models take into account the emergence of essential new functionalities in the
new releases. The first model refers to a situation where the innovation leader is fully
independent in shaping its development strategy, without taking into account the
progress made by its main competitors. The form of this model is justified by historical
data concerning the CMS innovation leader Drupal which point out that the novelties
offered in new releases of WordPress and Joomla! have not influenced the main line of
Drupal development so far. The second model assumes a symmetric impact of innovation among all software providers.
The model coefficients are calibrated based on historical data published by the CMS
providers on their websites. However, the coefficients of the second model that describe
the future dependence of the current innovation leader on the other systems after it
loses leadership could not be estimated as such a situation has never occurred before.
Instead, these coefficients are assumed to be similar to the reactions of other systems on
the competitor’s advantages. They are then used in calculation of future release generation forecasts together with estimated coefficients describing other CMSs. Thus, the
main goal of this paper is to apply the above foresight model composed of two
long-term software development scenarios to provide clues to SME managers as
regards the viability of the open source software they use as a seed information
technology [17] in their enterprises.
Trend extrapolation with vector autoregression has made it possible to construct two
scenarios of ERP-related functionality development prospects with open source CMSs
until 2025 and beyond, related to the above models. The technological trajectories thus
Software Innovation Dynamics in CMSs and Its Impact
generated show surprising future behaviour, indicating changes in innovation leadership. According to these forecasts, Drupal, the current innovation leader, may lose its
market position if it does not take into account the innovation potential of its competitors. In the final section, we will discuss the related basic business scenarios of SMEs
as well as the synergy with other EIS development trends. We conclude that the
open-source CMSs are viable enough to capture a considerable portion of the ERP for
SMEs market over the next decade. Consequently, vendors and authorized resellers of
popular ERP systems may lose some of their revenue to CMS developers.
ERP based on CMS and web portal technologies (in this paper referred to as
ERP CMS, cf. also [12, 23]) is an application that contains an administrative panel
endowed with a complex hierarchy of user authorisation levels. It can be used to
exchange information within the enterprise and externally, and manage its resources.
The ERP-CMS emerged as a result of including software development tools in advanced
CMSs, making possible the implementation of resource-management processes.
To conclude this section, in Box 1 below we recall a few basic notions related to
EISs used in further parts of this paper.
Box 1. Basic notions relevant to Enterprise Information System development
EIS: Enterprise Information System is a key notion used in this paper. EIS is a
broad class of business software facilitating information integration within an
enterprise and its exchange with external agents (clents, suppliers, business partners, authorities etc.).
EAI: Enterprise Application Integration (EAI) is the process of sharing and
linking different information and business processes in an organization [4, 13]. As
a result of EAI, most ICT-supported business processes are entirely controlled by a
unified application. The same application ensures access to all or most enterprise
databases and an appropriate common data management system. The interconnection of all the organisational data forms a basis for implementing efficient
decision support systems [1].
ERP: Enterprise Resource Planning is an ICT-based integration of business
processes in an organisation. Usually, ERP is implemented as an information
system with a modular architecture, where each module manages a specific area of
activity. The same shortcut, ERP, or ‘ERP system’™, also refers to an integrated
software solution that implements the above process [3].
ERM: Enterprise Resource Management is a term which refers to ERP in a
more general setting and gradually replaces the ERP term. It reflects the extension
of enterprise application integration beyond resource planning. The notion of
resources is more general and covers any information managed by the company
and its intangible assets. One of the ultimate goals of ERM is to provide decision
support to enterprise management [3, 8, 13].
CMS: A Content Management System is a web application (or a set of applications) that facilitates the development of a website, and other forms of
web-based information systems by non-technical content editors. CMSs are
increasingly including modules and functionalities that support enterprise management [5, 15].
A.M.J. Skulimowski and I. Badecka
2 Basic Properties and a Review of Popular Open-Source CMSs
The main idea of a typical CMS is to allow a non-expert user to edit the content of
a website and modify its design via an easy-to-use user interface, called an administration panel. The main task of CMS platforms is to separate the information content of
the website from the technical aspects of its appearance. CMSs generate web pages
automatically or semi-automatically. The information entered by an authorised editor is
stored in a database. The CMS generates a dynamic website based on the content of
this database and on a selected template. This allows for more flexible and convenient
content management than with static HTML files. Thanks to this approach, the publishing of web pages has become much simpler than in the past.
A Review of Most Popular CMS
Popularity scores of CMSs are provided by different web application market watchers
[2], where the number of installations varying in time can be found, with the demarcation of specific industries or sectors of use. The three systems selected for detailed
analysis in this paper cover most of the CMS market. They are briefly outlined below.
WordPress [21] has become the most popular content management system during the
current decade. It is written in PHP and uses a MySQL database. WordPress is distributed
under the GNU - General Public License. As the successor of the b2/cafélog blog system,
WordPress is the most popular and user-friendly system for blogging. Among the many
CMSs, WordPress appears as the easiest to install, operate and configure. Due to its
growing range of functionalities and ease of use, WordPress is increasingly popular in the
business sector, as well as with entertainment and social networking sites. With its recent
enterprise-oriented premium spin-out, WordPress Vip, the platform fulfils the early predictions concerning the CMS development trends contained in [18], cf. also https://vip.
The Joomla! project was founded in August 2005 by the team who developed
Mambo, a predecessor of Joomla [11]. After the development of Mambo was abandoned, future versions have appeared, such as Aliro (http://www.aliro.org),
Lanius CMS (http://www.laniuscms.org), Elxis (http://www.elxis.org) etc. Joomla has
a modular structure, which means that each new feature of the system is added as an
additional module. This allows enterprise users to easily extend the usability of the
system into the e-commerce, frequently starting from product catalogues, as well as to
use its CRM and ERP/ERM functionalities.
Drupal [7] is a CMS that allows its users to easily publish, manage and organise
web content. It is equipped with functionalities that include environments for collaborative work on projects, file exchange and much more. An important feature of Drupal
is its system of modules and taxonomy. The latter allows the users to organise web
contents according to predefined categories. Drupal plays an important role in this
paper due to its two exceptional features:
• In recent years, Drupal has led the CMS field in terms of implementing innovations
in its releases,
• Drupal is the leader in offering ERP-related modules [7].
Software Innovation Dynamics in CMSs and Its Impact
Overall, Drupal is equipped with many special tools that are useful in the business
sector. The platform is scalable and can be used to build enterprise information systems
of any size. Moreover, Drupal strives to simplify its use via ‘Drupal Open Enterprise’
The Technological Evolution of CMSs
The Fig. 1 summarises, in the form of a timeline, the history of versions, milestones,
and functionalities of the most popular CMSs covered in the preceding section:
Wordpress, Joomla and Drupal. The data has been gathered in [16] and [18] and
verified based on the recent information provided by the software suppliers.
Fig. 1. A timeline of the development of the three most frequently-used open source CMSs. The
coloured lines point out the release of main functionalities of these CMSs on the time scale.
A.M.J. Skulimowski and I. Badecka
The historical data characterising the relative innovativeness of the selected CMSs
represented in Fig. 1 and used to build the models presented in the next section is based
on [12]. The identification of relevant functionalities and milestones illustrated in the
above timeline, together with the version data, forms the basis on which to build both
technological evolution models.
3 Two Scenarios of Technological Evolution
There are various models which describe the generation of software innovation, however,
none of them is widely-accepted. The innovation diffusion models, cf. [10, 12], do not
fully explain the capability of instantaneous and worldwide appearance of a new software. When applied to open-source software, in the absence of price equilibria, the major
role played by essential technological progress must be taken into account. This is partly
included in the famous eight Lehman software evolution laws, cf. e.g. [22], pp. 49–53,
58–61, although they do not sufficiently refer to the emerging oligopolistic structure of
open source software supply. Therefore, in [17, 18] we proposed a new class of
stochastic models, suitable for information systems evolution modelling, including the
open source CMS. The dependent variable is the time of market release of a relevant
innovation by the developers. Explanatory variables are time lags between consecutive
new functionalities released by the same system and between different systems’ releases
with the same functionality. In [17] we compared models based on major version releases
and taking into account selected essential functionalities only.
Following the research reported in [17], this paper extends the scope of modelling
to include models based on analogies, to be applied in case where no or too few
observations are available to use statistical fitting techniques. This allows us to construct a realistic forecasting model for a situation where the hitherto market leader may
lose its leadership during the forecast period.
In both models presented in the following Subsects. 3.1 and 3.2 we take into
account only the releases of each software system that bring considerable innovation of
relevance to the user community. Specifically, for a given software system S we assume
that the time lags from recent releases of other systems are less important to explain the
evolution of this kind of software than the dates of the introduction of relevant functionalities, which system S does not yet possess. Based on empirical evidence, in the
first model we have additionally assumed that the technological leader (Drupal)
develops its system autonomously, i.e. without taking into account the innovations
introduced by systems that have been less advanced so far. Another development
scenario, which does not admit this assumption and treats all systems as dependent on
the innovations introduced by other developer teams, yields the second model. These
models will generate two CMS and ERP-CMS development and deployment scenarios,
termed Scenario A and Scenario B, respectively. Both models and the corresponding
scenarios are investigated are compared in Sect. 4.
As already mentioned, the coefficients and their goodness of fit of the second model
(Scenario B) could not be determined properly because the situation where the mutual
dependence of the leading system on the others did not occur in the past. However,
simulations of such behavior based on analogy to other developer teams are possible
Software Innovation Dynamics in CMSs and Its Impact
and are crucial in building the second model. The goodness of fit of a simulated Model
2 may be worse than in the case of Model 1 since the estimation of missing interdependence parameters is based on analogies, and not on the extrapolation. However, the
study of the second model has an exploratory character while its lower accuracy will be
compensated by a better coverage of possible states of the future.
In the following sections we will present the assumptions that led to the formulation
of both models, the forecasts that they generate, the impact on the development of CMS
technology and the resulting conclusions.
Innovation Forecasts in Scenario A
Let x(i), y(i), and z(i) denote time intervals between next (i-th) essential system
improvement, for WordPress, Joomla! and Drupal, respectively. We can then examine
the relationship between the time of the next new functionality introduction by a
particular development team and the frequency of similar innovations in the past,
created by the developer teams of all above CMSs.
We assume that the time interval of the subsequent system improvement depends
linearly on the n−1 previous time intervals between system new functionalities as well
as on the frequency of the functionalities emerging in other systems. Motivated by a
timeline chart analysis (Fig. 1), which showed Drupal leading in the implementation of
all previous innovations, the first scenario will additionally assume that the Drupal
development team operates autonomously, without taking into account subsequent
versions of WordPress and Joomla. The innovations introduced by the latter two are
already present in Drupal. These innovations are not influenced by a technological
competitive pressure on the Drupal development team. Such pressure may, however,
result from introducing marketing or organisational innovations, which need not be
linked with the successive versions of software. These assumptions lead to the formulation of the following model of innovation creation in the three analysed systems
(1), which is simplified slightly compared to that proposed in [17]:
xðk þ 1Þ ¼ a1;1 xðk Þ þ a1;2 xðk 1Þ þ . . . þ a1n xðk n þ 1Þ
þ b1;2 v1;2 ðkÞ þ b1;3 v1;3 ðkÞ þ c1
yðk þ 1Þ ¼ a2;1 yðk Þ þ a2;2 yðk 1Þ þ . . . þ a2n yðk n þ 1Þ
þ b2;1 v2;1 ðkÞ þ b2;3 v2;3 ðkÞ þ c2
zðk þ 1Þ ¼ a31 zðkÞ þ a32 zðk 1Þ þ . . . þ a3n zðk m þ 1Þ þ c3
v12(k) – average frequency of introducing a new functionality of Joomla calculated
on the basis of P1,2(k) time intervals between essential releases of this system
directly preceding the k-th essential functionality of WordPress;
v21(k) - average frequency of introducing of a new functionality of WordPress
calculated on the basis of P2,1(k) time intervals between essential releases of this
system directly preceding the k-th essential functionality of Joomla
A.M.J. Skulimowski and I. Badecka
vj,3(k), for j = 1, 2 - average frequency of introducing of a new version of Drupal
calculated on the basis of Pj,3(k) time intervals between essential releases of this
system directly preceding the k-th essential functionality of WordPress - for j = 1,
and Joomla - for j = 2.
Based on numerical experiments, in the above model we will admit a further
simplifying assumption, namely as Pi,j(k), we will take the maximum value Pi,j(k) = 3,
for i = 1,2, j = 1, 2, 3, i ≠ j and n = 1, m = 2, which, however, will allow us to obtain a
sufficient statistical significance of the model. For the sake of brevity we omit the
disturbance in Eq. (1) and in all further models. Furthermore, we will take into account
only positive time lags between launching essential functionalities, i.e. we assume that
exclusively the technological arrears play a motivating role for the developer teams.
Let aj1,aj2,…,ajn denote the direction coefficients of the regression equation for the
dependence of the time interval of the j-th system new functionality appearance on the
previous frequencies of innovation introduced in the same system, i.e. WordPress,
Joomla and Drupal, respectively, for j = 1, 2, 3. The bi,j for i = 1, 2, j = 1, 2, 3, and
i ≠ j, denote the coefficients of the linear regression functions that describe the following dependencies:
– for i = 1, j = 2, 3, bi,j is the coefficient of multivariate linear regression which
explains x(k) with v1,j(k),
– for i = 2, j = 1, bi,j is the coefficient of multivariate linear regression which explains
y(k) with v2,1(k) and for i = 2, j = 3, bi,j explains y(k) with v1,3(k).
After finding the coefficients of (1) with the least squares method (LSM), we will
get the regression function relating the expected time of a new innovation release by
each system – as dependent variables — to the average time lags between the introduction of technological innovations in all systems. The lags are calculated to the
releases directly preceding the latest innovation in the i-th system, prior to the k-th
improvement of the system described by the Eqs. (1a) or (1b). The trend drift coefficients ci, i = 1,2,3 should vanish after the Model (1) for autoregresssion purposes was
integrated sufficiently many times to yield a stationary time series for each system.
However, they re-appear when calculating the forecasts for the original time series.
The Eqs. (1a), (1b) take into account the pressure of competition. According to the
initial assumption, Eq. (1c) describes time lags between subsequent Drupal improvements only.
It turned out impossible to find a statistically significant model where the variables
were simple time lags between innovations. However, a stationary time series and, at
the same time, significant regression functions could be found when averaging the
variables three times. From the observation [17] that, by the definition of variables as
time lags, they possess the property (2),
xðtÞ xðt nÞ ¼ ½ðxðtÞ xðt 1Þ þ ½xðt 1Þ xðt 2Þ þ . . . þ ½xðt n þ 1Þ
xðt nÞ;
it follows that the above averaging operation is equivalent to the integration of the
original time series.
Software Innovation Dynamics in CMSs and Its Impact
The statistical significance of both models was investigated with the F (FisherSnedecor, cf. e.g. [19]) and goodness of fit tests with the determination coefficient R2.
Tests confirmed the significance of the calculation results for triple averaged data,
which corresponds to the case where all variables characterising the frequency of
innovations are calculated as an average of three time lags between essential new
functionalities in all systems modeled.
Finally, the forecasting model in Scenario A is provided in Eq. (3) below.
xðk þ 1Þ ¼ 0:1521 xðkÞ þ 0:6284 v1;2 ðk Þ þ 2:9762 v1;3 ðkÞ þ 8:2919
yðk þ 1Þ ¼ 0:3679 yðkÞ þ 2:9762 v2;1 ðkÞ 2:0860 v2;3 ðk Þ 18:3726
zðk þ 1Þ ¼ 1:1662 zðkÞ 0:6892 zðk 1Þ þ 4:6565
As the significance of the regression function does not guarantee that the coefficients are significant, we calculated the confidence intervals for each of them. Then, to
complete the usual statistical analysis of forecasted values, we calculated the confidence
intervals for the number of innovations. The forecasting procedure was pursued until
the last release of the most slowly evolving system reached the year 2025, i.e. the
foresight horizon of the project SCETIST [16].
The forecasting results for the horizons 2025 and 2030 are provided in Table 1,
while the forecasts of the number of implemented innovations are visualised in Fig. 2.
The chart shows that Drupal, which was assumed to develop autonomously, will
temporarily lose its leadership twice. Joomla will catch up with Drupal within a few
years and the development of both systems will follow the same trajectory until 2025.
Both systems will overtake Wordpress. In about 10 years from now all systems will
converge, i.e. the number of implemented innovations in WordPress, Joomla and
Drupal will be practically equal. This may ultimately result in mergers or otherwise
elimination of some systems from the market as the community of developers will not
differentiate them anymore. The simulation points out that this situation will happen
around 2030. However, the confidence intervals of the forecasted quantities will
already be rather large in 2025 and grow further beyond this time horizon (cf. Table 1)
while the longer-term forecasts may be disturbed by new factors and phenomena that
did not appear by now and were not included in model (1).
Table 1. The forecasts of implemented innovations in system releases and their confidence
intervals (p = 0.95) for the Scenario A
Number of
Forecast Number of
Number of
A.M.J. Skulimowski and I. Badecka
No. of
Time yy-mm
Fig. 2. Forecasted number of implemented innovations by each of the systems in Scenario A.
Innovation Forecasts in Scenario B
Let us recall that the Scenario B differs from the Scenario A on the assumption that all
systems, including the current innovation leader, take into account the new functionalities introduced in the software releases of their competitors. Their decisions concerning the time of new version releases are modified quasi-linearly according to the
time elapsed since the new functionalities were launched by other systems. To build a
CMS forecasting model with this assumption, we will use the same essential functional
improvements that have been identified for the Model 1 and Scenario A, cf. Eq. (1),
Fig. 2. The interdependence coefficients between competing systems are calculated
based on the time lags (in months) between subsequent (i-th) essential system
improvements. The assumptions concerning the variables are the same as in Model 1,
in particular only positive lags between the introduction of new functionalities are
taken into account. This led us to the formulation of an extended model of essential
innovation generation in WordPress, Joomla, and Drupal, described by Eq. (4).
xðk þ 1Þ ¼ a1;1 xðk Þ þ a1;2 xðk 1Þ þ . . . þ a1n xðk n þ 1Þ
þ b1;2 v1;2 ðkÞ þ b1;3 v1;3 ðkÞ þ c1
yðk þ 1Þ ¼ a2;1 yðk Þ þ a2;2 yðk 1Þ þ . . . þ a2n yðk n þ 1Þ
þ b2;1 v2;1 ðkÞ þ b2;3 v2;3 ðkÞ þ c2
Software Innovation Dynamics in CMSs and Its Impact
zðk þ 1Þ ¼ a31 zðkÞ þ a32 zðk 1Þ þ . . . þ a3n zðk n þ 1Þ
þ b3;1 v3;1 ðkÞ þ b3;2 v3;2 ðkÞ þ c3
The notation is similar to that used in Scenario A and Model (1). The new coefficients of the Model 2, v3,j(k), for j = 1, 2, denote the average frequency of introducing
a new version of Wordpress (j = 1) or Joomla (j = 2) calculated on the basis of P3,j(k)
time intervals between essential releases of the corresponding system directly preceding the k-th essential functionality of Drupal. The other difference consists in the
fact that the coefficients b3,1 and b3,2 could not be estimated with multivariate
regression because during the observation period (2004–2012) Drupal was always a
leader in launching new functionalities, so the variables v3,1(k) and v3,2(k) were identically equal to 0. To simulate the reaction of the Drupal’s team on a potential different
situation that may occur in the future, b3,1 and b3,2 were assumed equal to 1.5 and 0.7,
respectively. These values resulted from the averaging of the reactions of WordPress
and Joomla teams on the past Drupal’s advantages, taking into account the mean values
of the variables x, y, and z. The subjectivity of this assumption illustrates the difference
between the forecasting and foresight methodology: foresight approaches make it
possible to explore future in the situation where no extrapolation of the past is possible,
based on heuristic observations and assumptions. Analogy-based reasoning is the only
way to investigate the phenomena that have never occurred before.
The triple averaging of input data yielded again the best outcomes. The coefficients
of the second model are provided in Eq. (5) below.
xðk þ 1Þ ¼ 0:1521 xðkÞ þ 0:6284 v1;2 ðk Þ þ 2:9762 v1;3 ðkÞ þ 8:2919
yðk þ 1Þ ¼ 0:3679 yðk Þ þ 2:9762 v2;1 ðk Þ 2:0860 v2;3 ðk Þ 18:3726
zðk þ 1Þ ¼ 1:1662 zðkÞ 0:6892 zðk 1Þ þ 1:5 v3;1 ðkÞ þ 0:7 v3;2 ðkÞ þ 4:6565 ð5cÞ
The significance of Model (5) is the same as of Model (1) during the observation
period and it is undefined during the forecasting period. The occurrence of any event
that causes non-zero values of v3,1(k) and v3,2(k) would make it possible to use
supervisory learning techniques to update the coefficients b3,1 and b3,2. The results of
the CMS innovation implementation forecasts in Scenario B are shown in Fig. 3.
The ability to observe the other systems and react gives the Drupal team a
development impulse every time it is surpassed by Joomla. In this scenario WordPress
stays initially behind, but at the end of the forecasting period the functionalities of all
systems differ only slightly. Similarly as in Scenario A, it gives again clues as regards
potential mergers or transformations of main open source CMSs during the next
decade. The evolution of Internet, the emergence of new business models, and new
information technologies may additionally contribute to an end of CMS era.
The confidence intervals for the quantities of implemented innovations by the two
forecasting horizons 2025 and 2030 are presented in Table 2 below. The assumed
coefficients b3,1 and b3,2 are regarded deterministic and do not influence the stochastic
properties of Model (5).
A.M.J. Skulimowski and I. Badecka
No. of
Time yy-mm
Fig. 3. Forecasted number of innovations for the three CMS in Model 2 - Scenario B
Table 2. The forecasts of implemented innovations in system releases and their confidence
intervals (p = 0.95) for the Scenario B
Number of
Forecast Number of
Number of
4 A Comparison of Scenarios A and B
A comparison of both CMS evolution scenarios is shown in Table 3 below. It turns out
that there is no considerable difference in CMS innovative behavior between the
scenarios, with some more activity in Scenario B. All systems generate more
Table 3. Expected number of essential innovations in the CMS releases until 2030 – a scenario
Software Innovation Dynamics in CMSs and Its Impact
innovations in Scenario B. The ability to consider technological and market signals
from the other two systems allowed Drupal to generate slightly more (about 5% in
2030) new essential functionalities in its releases. Without this ability Drupal would
still preserve its leadership until 2030, however. More active Drupal in Scenario B
boosts the innovativeness of Joomla – its number of essential innovation rises on 5 (or
on 12%) until 2030. A better performance of Joomla is accompanied by a slightly
lower rise of the innovativeness of WordPress between 2025 and 2030.
The convergence trend is more salient in Scenario B, where the number of functionalities of all systems in 2025 and 2030 is almost the same.
Another comparison is provided in Table 4 which shows the value of the indicator
“CMS innovativeness growth index” defined in [17]. This indicator (5th row in Table 4)
is calculated as the ratio of the forecasted yearly average number of innovations during
the period of 2012–2025 (4th row in Table 4) and the actual average value observed
during the period of 2005-2012 (3rd row).
Table 4. Number of innovations in CMS releases until 2025 – Scenarios A and B compared
No. of innovations in
No. of innovations in
Mean innovations no. in
Mean innovations no. in
Average innovation growth
Scenario A Scenario B Scenario A Scenario B Scenario A Scenario B
The above results show that the assumption concerning the market leader behavior
that differentiate both scenarios may have a remarkable influence on the innovativeness
of all the developer teams concerned. It turns out that a ‘weaker’ leader that follows its
own development strategy and does not react to its competitors’ achievements suppresses the innovative activity of all market players. On the contrary, its activity may
increase the market strength of all open source CMS suppliers and reduces the differences between their products and innovativeness of developer teams.
Future Features and Functionalities of CMSs
An important issue that has not been explained yet is the transition between quantitative
innovation characteristics and a concrete sequence of innovations that will be created
by each of the CMS development teams. A partial answer can be derived from
development plans that, however, are announced for a relatively short period of for one
or two years only [7, 11, 21]. Other clues are given by the version and functionality
analysis performed in Sect. 2.2 and by both innovation development models presented
A.M.J. Skulimowski and I. Badecka
in Sects. 3.1, 3.2 and 4. It follows that Drupal plays the role of leading innovator, while
the other systems follow its functionalities with some delay. Taking into account that
Drupal has already entered the ERP/ERM market and plans further expansion in
enterprise software sector, it can be concluded that the other systems will follow suit
during the next decade.
In addition, all the above-analysed CMS developers pay much attention to research,
which allows them to quickly implement new technologies. Below is a (nonexhaustive) list of ideas, functionalities, and new modules that are either in the process
of being tested or are planned for implementation during the next few years (cf. the
CMS web sites [7, 11, 21]). Some of them will support ERP functionalities or meet
general enterprise needs. A few new ERP-related features, namely innovation support
modules as well as HR management systems, have been pointed out as relevant by the
respondents of the Delphi survey performed within the project SCETIST [16]. They are
also included in the list below.
• Cognitive user models allowing the system to learn users interact with the website
and the applications installed there. These models will be further developed towards
intelligent recommendation systems.
• General enterprise managerial decision support modules using integrated
CMS-supported databases containing data provided by other subsystems [1].
• HR management.
• Quality management modules – different versions according to the sector needs and
ISO, EMAS, or other quality norms implemented.
• Template builder – easy creation of templates from the administration panel.
• In-company innovation support systems.
The approach applied in this paper yields models that put emphasis on the overall
number of functionalities implemented in software systems. They do not yet allow us to
determine the sequence of implementation of specific features. However, the latter may
be estimated based on development roadmaps published on CMS websites, on research
of market expectations and scenarios of software use, and on general software evolution trends. A more detailed study of technological trajectories of ERP-CMS and
enterprise decision support systems will be a subject of further research.
5 Conclusions
In this study, we described the foundations of an important software development
trend, namely enterprise application integration (EAI) based on the gradual expansion
of the scope of applications covered by a company’s CMS functionalities. We conducted an overview and comparison of the most popular open source CMSs used
primarily, but not exclusively, by SMEs. We maintain that this trend will bring new
technological opportunities to SMEs in particular. Using open-source software modules, SMEs will be able to cover more areas of their commercial activity with modern
ICT solutions. This trend will interfere with the “going mobile” and “moving to the
cloud” trends that are influencing all enterprises, but are of particular importance to
small and micro companies. Specifically, web-based ERP/ERM applications are in a
Software Innovation Dynamics in CMSs and Its Impact
better position to include mobile technologies than offline systems. This is the case of
applications built with CMS technology, both open-source and commercial.
While CMS-based systems will expand into the cloud and mobile worlds,
according to the Delphi survey performed in [16], the accounting software used in
SMEs in the perspective of 2025 will still resist full integration in cloud-based systems.
This is partly due to the traditional vigilance and concern that entrepreneurs have as
regards revealing their financial data in the public space, despite the fact that ICT
security has been steadily improving. This contradicts the widely-held belief that a vast
majority of enterprises will gradually implement professional ERP software built
around accounting, inventory and sales modules. Thus, accounting and personal record
files may remain the core of traditional ERP systems. However, this could vary from
country to country [14] and may depend on the economic sector.
Finally, let us observe that the methodology presented in Sects. 2, 3 and 4 can be
used to investigate the technology race between companies that regularly release new
versions of non-CMS software products. A necessary prerequisite is that the market for
the product under study is oligopolistic, with a reasonably small number of suppliers
(say, less than 20) that enables building and constructively analyzing a model of type
(1). Some hints regarding sample-based generalization strategies of software-related
models are provided in [20]. While many open source applications and programming
environments fulfill this assumption well, the approach presented in this paper is not
restricted to software and may be used to forecast the development of electronic
components, cars, and other products and technologies.
Acknowledgement. This research was supported by the research project “Scenarios and
Development Trends of Selected Information Society Technologies until 2025” (SCETIST)
co-financed by the ERDF, Contract No. WND-POIG.01.01.01-00-021/09.
1. Asprey, L., Middleton, M.R.: Integrated document management for decision support. In:
Burstein, F., Holsapple, C. (eds.) Handbook on DSS, vol. 1, pp. 191–206. Springer,
Heidelberg (2008)
2. Builtwith website (web technology lookup). trends.builtwith.com/cms. Accessed July 2016
3. Castellina, N.: SaaS and Cloud ERP Trends, Observations, and Performances. Aberdeen
Group, Boston (2011)
4. Cataldo, A., McQueen, R.J., Hardings, J.: Comparing strategic IT alignment versus process
IT Alignment in SMEs. J. Res. Pract. Inf. Technol. 44(1), 43–57 (2012)
5. Chia-Chen, Y.; Chiaming, Y.; Jih-Shih, H.: A web-based CMS/PDM integration for product
design and manufacturing. In: IEEE International Conference on e-Business Engineering,
pp. 549–553 (2008)
6. Devos, J., Landeghem, H., Deschoolmeester, D.: Using bricolage to facilitate emergent
collectives in SMEs. In: Proceedings of the 6th European Conference on Information
Management and Evaluation, pp. 82–90 (2012)
7. Drupal web site. Drupal Foundation, www.drupal.org, Accessed July 2016
A.M.J. Skulimowski and I. Badecka
8. Hailu, A., Rahman, S.: Evaluation of key success factors influencing ERP implementation
success. In: Proceedings of the 2012 IEEE 8th World Congress on Services, pp. 89–91
9. Hallikainen, P., Kivijarvi, H., Nurmimaki, K.: Evaluating strategic IT investments: an
assessment of investment alternatives for a web content management system. In:
Proceedings of the 35th HICSS, pp. 2977–2986 (2002)
10. Jiang, Z., Jain, D.C.: A generalized Norton-Bass model for multigeneration diffusion.
Manag. Sci. 58(10), 1887–1897 (2012)
11. Joomla! web site. www.joomla.org, Open Source Matters, Inc. Accessed July 2016
12. Kapur, P.K., Sachdeva, N., Singh, O.: Generalized discrete time model for multi
generational technological products. In: Proceedings of the International Conference on
Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE),
pp. 717–723 (2015)
13. Lee, J., Siau, K., Hong, S.: Enterprise integration with ERP and EAI. Commun. ACM 46(2),
54–60 (2003)
14. Leina, Z., Tiejun, P., Guoyan, R., Chengbin, F.: Development and implementation of
ERP/CRM system based on open source software to small and medium-sized enterprise in
China. In: ICICTA, vol. 2, pp. 725–730 (2008)
15. Liu, X., Zhang, W.J., Prasad, R., Tu, Y.L.: Manufacturing perspective of enterprise
application integration: the state of the art review. Int. J. Prod. Res. 46(16), 4567–4596
16. Skulimowski, A.M.J. (ed.): Trends and Development Scenarios of Selected Information
Society Technologies (in Polish). Progress and Business Publishers, Kraków (2015)
17. Skulimowski, A.M.J., Badecka, I.: Competition boosts innovativeness: a quantitative
software evolution model. In: Theeramunkong, T., et al. (eds.) Proceedings of the Tenth
International Conference on Knowledge, Information and Creativity Support Systems
(KICSS 2015), 12–14 November 2015, Phuket, Thailand, pp. 433–446 (2015)
18. Skulimowski, A.M.J., Badecka, I., Golonka, D.: New trends in the technological
development of CMS-based enterprise software for SMEs. In: Skulimowski, A.M.J. (ed.)
Trends and Development Scenarios of Selected Information Society Technologies. Progress
and Business Publishers, Kraków (2015)
19. Snedecor, G.W., Cochran, W.G.: Statistical methods. J. Educ. Behav. Stat. 19(3), 304–307
20. Wieringa, R.J., Daneva, M.: Six strategies for generalizing software engineering theories.
Sci. Comput. Program. 101, 136–152 (2015)
21. WordPress web site, www.wordpress.org. Accessed July 2016
22. Tripathy, P., Naik, K.: Software Evolution and Maintenance: A Practitioner’s Approach,
p. 393. Wiley, Hoboken (2015)
23. Zykov, V.S.: Integrating enterprise software applications with web portal technology. In:
Proceedings of the 5th International Workshop on Computer Science and Information
Technologies CSIT 2003, vol. 1, Ufa, Russia, pp. 60–65. Ufa State Aviation Technical
University Editorial-Publishing Office (2003). arXiv:cs/0607127
Optimization of Cloud-Based Applications
Using Multi-site QoS Information
Hong Thai Tran1(&) and George Feuerlicht1,2,3
Faculty of Engineering and Information Technology,
University of Technology, Sydney, Ultimo, Australia
Unicorn College, V Kapslovně 2767/2, 130 00 Prague 3, Czech Republic
Department of Information Technology, University of Economics,
Prague, W. Churchill Sq. 4, Prague 3, Czech Republic
Abstract. With rapid increase of the use of cloud services, the availability of
Quality of Service (QoS) information is becoming of utmost importance to assist
application managers in selection of suitable services for their enterprise
applications. Due to different characteristics of cloud and on-premise services,
monitoring and management of cloud-based enterprise applications requires a
different approach that involves the monitoring of QoS parameters such as
availability and response time in different geographic locations. In this paper, we
propose a multi-site model for the monitoring and optimization of cloud-based
enterprise applications that evaluates the availability and response time of cloud
services concurrently across different geographic locations. Our preliminary
results using eWay and PayPal payment services monitored in eleven sites
across four geographic regions indicate that location-based information can be
used to improve the reliability and performance of cloud-based enterprise
1 Introduction
SOA (Service Oriented Architecture) is evolving towards a more flexible, dynamically
scalable cloud-based computing architecture for enterprise applications. Typically,
multiple cloud and on-premise services are composed using different protocols and
integration methods to provide the required enterprise application functionality. As
cloud services are sourced from different cloud providers their QoS (Quality of Service)
characteristics can substantially differ depending on the geographical location and on
the provider cloud infrastructure. While most cloud service providers publish QoS
information on their websites, it often does not accurately reflect the values measured at
the consumer site as the performance of cloud services is impacted by numerous factors
that include dynamic changes in network bandwidth and topology and transmission
channel interference [1]. Additionally, changes in provider internal architecture and
method of service delivery can significantly impact on QoS characteristics of cloud
services. Consequently, consumer monitoring and optimization of the runtime
behaviour of cloud services has become critically important for the management of
enterprise applications [2].
© IFIP International Federation for Information Processing 2016
Published by Springer International Publishing AG 2016. All Rights Reserved
A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 325–338, 2016.
DOI: 10.1007/978-3-319-49944-4_24
H.T. Tran and G. Feuerlicht
Service monitoring is a run-time activity that involves recording the values of
response time, availability and other non-functional service parameters in order to
enable predictive analysis and proactive service management. Service monitoring and
service management in cloud computing environments presents a particular challenge
to application administrators as the enterprise application is dependent on the performance and availability of third-party cloud services. The traditional approach to QoS
monitoring is based on continuously sending test messages to critical services to check
their availability and performance. This approach is not suitable for the monitoring of
cloud services as it increases service costs and generates unnecessary data traffic.
Monitoring and optimization of QoS of cloud services presents an important and
challenging research problem. Although some research work on monitoring of QoS
characteristics of cloud services is available in the literature, there is currently lack of
detailed information about the assessment of run-time behaviour of cloud services that
includes location-based QoS information [3], making informed decisions about the
selection and composition of cloud services difficult in practice [4, 5].
In our earlier work we have described the features of the Service Consumer
Framework (SCF) designed to improve the reliability of cloud-based enterprise
applications by managing service outages and service evolution. We have implemented
and experimentally evaluated availability and response time characteristics of payment
services (PayPal and eWay) using three separate reliability strategies (Retry Fault
Tolerance, Recovery Block Fault Tolerance, and Dynamic Sequential Fault Tolerance)
and compared these experimental results with theoretically predicted values [6].
In this paper we extend this work by focusing on improving the estimates of
availability and response time of cloud services by introducing location-based QoS
information. We monitor QoS characteristics of eWay and PayPal services across
eleven locations in four geographical regions to obtain a more accurate estimate of
response time and availability for specific deployment locations of consumer enterprise
application. We collect the QoS information independently of the information published by cloud service providers by recording payment transaction log data in a
monitoring database. In the next section (Sect. 2) we review related literature dealing
with monitoring the performance of cloud-based services, and in Sect. 3 we discuss
service optimization using multi-site monitoring. Section 4 describes our experimental
setup for multi-site monitoring of cloud services and gives experimental results of
availability and response time for eWay and PayPal payment services measured at
eleven geographic locations. Section 5 contains our conclusions and proposals for
future work.
2 Related Work
Optimization techniques to improve reliability and performance of enterprise applications that include fault prevention and forecasting have been the subject of research
interest for a number of years [7]. Such techniques have been recently adapted for web
services and cloud-based enterprise applications. Using redundancy-based fault tolerance strategies, Zibin and Lyu [8] propose a distributed replication strategy evaluation
and selection framework for fault tolerant web services. Authors compare various
Optimization of Cloud-Based Applications
replication strategies and propose a replication strategy selection algorithm. Adams
et al. [9] describe fundamental reliability concepts and a reliability design-time process
for organizations, providing guidelines for IT architects to mitigate potential failures of
cloud-based applications.
Developing reliable cloud-based applications involves a number of new challenges,
as enterprise applications are no longer under the full of control of local developers and
administrators. In response to such challenges, Zibin et al. [10] present a FTCloud
component ranking framework for fault-tolerant cloud applications. Using structurebased component ranking and hybrid component ranking algorithms, authors identify
the most critical components of cloud applications and then determine an optimal
fault-tolerance strategy for these components. Based on this work, Reddy and Nalini
[11] propose the FT2R2Cloud framework as a fault tolerant solution using time-out and
retransmission of requests for cloud applications. FT2R2Cloud measures the reliability
of software components in terms of the number of responses and throughput. Authors
propose an algorithm to rank software components based on reliability as calculated
using number of service outages and service invocations over a period of time.
Other authors focus on QoS optimization, for example Deng and Xing [12] proposed a QoS-oriented optimization model for service selection. This approach involves
developing a lightweight QoS model, which defines functionality, performance, cost,
and trust as QoS parameters of a service. Authors have verified the validity of the
model by simulation of cases that show the effectiveness of service selection based on
these QoS parameters. Leitner et al. [13] formalize the problem of finding an optimal
set of adaptations, which minimizes the total cost arising from Service Level Agreement (SLA) violations and the cost of preventing the violations. Authors present
possible algorithms to solve this complex optimization problem, and describe an
end-to-end approach based on the PREvent (Prediction and Prevention based on Event
monitoring) framework. They discuss experimental results that show how the application of their approach leads to reduced service provider costs and explain the circumstances in which different algorithms lead to satisfactory results. Other authors
have focused on predicting future QoS values using service performance history
records. Wenmin et al. [1] present a history record-based service optimization method,
called HireSome that aims at enhancing the reliability of service composition plans.
The method takes advantage of service QoS history records collected by the consumer,
avoiding the use of QoS values recorded by the service provider. Authors use a case
study of a multimedia delivery application to validate their method. Lee et al. [14]
present a QoS management framework that is used to quantitatively measure QoS and
to analytically plan and allocate resources. In this model, end users quality preferences
are considered when system resources are apportioned across multiple applications,
ensuring that the net end-user benefit is maximized. Using semantically based techniques to automatically optimize service delivery, Fallon and O’Sullivan [15] introduce
the Semantic Service Analysis and Optimization (AESOP) approach and a Service
Experience and Context COllection (SECCO) framework. The AESOP knowledge
base models the end-user service management domain in a manner that is aware of the
temporal properties of the services. The autonomic AESOP Engine runs efficient
semantic algorithms that implement the Monitor, Analyze, Plan, and Execute (MAPE)
functions using temporal properties to operate on small partitioned subsets of the
H.T. Tran and G. Feuerlicht
Fig. 1. Online shopping check out optimization scenario
knowledge base. A case study is used to demonstrate that AESOP is also applicable in
the Mobile Broadband Access domain.
So far only a very limited attention has been paid to using location-based QoS
information for the optimization of cloud-based enterprise applications.
3 Service Optimization Using Multi-site Monitoring
Service optimization is concerned with continuous service improvement and aims to
optimize performance and cost of business services. Consider, for example, the situation illustrated in Fig. 1 that shows an Online Shopping Check Out service that
includes a cloud-based payment gateway. At design time, the service consumer needs
to select a suitable payment service to integrate into the business workflow ensuring
that both the functional and non-functional requirements are satisfied. Making this
selection decision requires the knowledge of QoS parameters at the site where the
enterprise application is deployed.
Typically, both the service provider and service consumer perform service monitoring independently, and both parties are responsible for resolving service quality
issues that may arise. Service providers maintain transactions logs and make these logs
available to service consumers who can use this information to calculate service costs
and to estimate service QoS. Provider QoS data is collected continuously at the provider site irrespective of any connectivity issues and includes information about
planned and unplanned outages. However, the QoS values published by service providers may not accurately reflect the values measured at the service consumer site as
QoS depends on the deployment location of the enterprise application and is affected
by the quality of the network connection, provider location, and service configuration.
With some global cloud service providers, the actual location from which the service is
delivered may not be known to service consumers, making it difficult to optimize the
Optimization of Cloud-Based Applications
Fig. 2. Multi-site cloud service monitoring
performance of the enterprise application based on QoS values published by the provider. The QoS values measured at the consumer site are impacted by connectivity
issues, and while these values may not fully reflect provider site QoS measurements
they are important indicators of enterprise application performance. Multi-site monitoring can be used to overcome the limitations of single-site (provider or consumer)
QoS monitoring by mapping the behaviour of cloud services across different sites and
geographical regions. We argue that in order to fully optimize cloud service selection
and deployment and to ensure that the non-functional requirements are met at run-time,
the service consumer needs to know the runtime QoS values of cloud services as
measured in different geographic locations. To accomplish this, we propose a model
that uses a centralized monitoring database to collect service QoS data from multiple
service consumer locations and making this data available for analysis by service
consumers (Fig. 2). This can be achieved by collaboration among different service
consumers who record their local monitoring data in a global QoS database and share
this information with other consumers of cloud services. The implementation of such a
shared QoS monitoring database would enable accurate real-time QoS analysis and
real-time notifications of QoS issues. Runtime performance information (i.e. response
time, availability and various types of error messages) recorded in the database can be
used by application administrators to monitor service utilization, plan maintenance
activities, and to perform statistical analysis of response time and throughput for
individual cloud services.
H.T. Tran and G. Feuerlicht
Enterprise Application Optimization Strategies
Optimization of enterprise applications that use cloud services may involve a number
of different strategies that range from using alternative cloud services to migrating the
servers that run the application to a different cloud infrastructure. With increasing
availability of alternative cloud services with equivalent functionality, service consumers can chose services to use in their enterprise applications based on the cost and
QoS characteristics. This may involve deployment of a new version of an existing
service or replacement of the service with an alternative from a different provider, if the
original service becomes obsolete or too costly. Service consumers can also optimise
application performance by re-locating the application to a different cloud infrastructure
selecting a more suitable geographic location, taking into account both end-user connectivity and connectivity to third-party cloud services. Finally, QoS characteristics of
cloud-based enterprise applications can be improved by using various reliability
strategies, re-configuring cloud services to provide higher levels of fault tolerance [6].
These fault tolerance strategies include Retry Fault Tolerance (RFT), Recovery Block
Fault Tolerance (RBFT) and Dynamic Sequential Fault Tolerance (DFST) strategies.
Using RFT strategy, cloud services are repeatedly invoked following a delay period
until the service invocation succeeds. RFT helps to improve reliability, in particular
in situations characterized by short-term outages. The RBFT strategy relies on service
substitution using alternative services invoked in a specified sequence. This failover
configuration includes a primary cloud service used as a default (active) service, and
stand-by services that are deployed in the event of the failure of the primary service, or
when the primary service becomes unavailable because of scheduled/unscheduled
maintenance. The DFST strategy is a combination of the RFT and RBFT strategies that
deploys an alternative service when the primary service fails following RFT retries
[16]. The choice of an optimal strategy for the deployment of cloud services must be
based on in-depth knowledge of QoS characteristics including their dependence on the
geographical location.
4 Experimental Setup for Multi-site Monitoring
In order to evaluate the proposed location-based QoS approach to optimization of
cloud-based enterprise applications we have implemented an experimental multi-site
monitoring environment for two payment services: PayPal Pilot service (pilotpayflowpro.paypal.com) and eWay Sandbox (https://api.sandbox.ewaypayments.com).
The QoS data was collected using Amazon Elastic Compute Cloud (AWS EC2) servers
deployed at eleven sites (Mumbai, Seoul, Singapore, Sydney, Tokyo, Frankfurt, Ireland, Sao Paulo, California, Oregon and Virginia) across four different geographic
regions (Asia Pacific, Europe, South America and the US). The monitoring database
was implemented using Microsoft SQL Server Amazon Relational Database (AWS
RDB). The QoS data was collected in each site by monitoring payment transactions and
removing private data such as customer information before recording the information in
the monitoring database. Simulating over 200,000 payment transactions initiated by
300 users, payment services were invoked using the SCF (Service Consumer
Optimization of Cloud-Based Applications
Framework) payment service adaptor that logs the service name, location, start time,
end time, result, and error code for each payment transaction [17].
The payment service response time for a transaction ðTT Þ was calculated as:
TT ¼ TE Ts
where TE is the end time of transaction and TS is the start time of a transaction, and the
average response time ðTS Þ of a service was calculated as:
Ts ¼
where n is number of transactions, and TT is response time of a transaction in Eq. (1).
Similarly, an inactive time or downtime of a service ðTI Þ is calculated as:
where TIS is the start time of a failed transaction and TAs is the start time of the next
successful transaction. Then, the availability of a service (As ) is calculated as:
PFs ¼
AS ¼ 1 PFS
where D is the duration of test period that is calculated using the end time of last
transaction ðTLE Þ and the start time of first transaction ðTFS Þ. PFs is probability of
failure of a service, and TI is a downtime in Eq. (3) and AS is the availability of a
Table 1 shows the response time of eWay and Paypal payment services as measured in different geographical locations over the monitored period 20th to 28th August
2016. The table shows that the response time of the eWay service is better (in most
cases less than half) than the response time of the PayPal service, while the availability
of both services is approximately the same. Both response time and availability are
influenced by two major factors: provider QoS characteristics and the reliability of the
network connection. In order to optimize the consumer side QoS characteristics it is
important to identify which of these factors plays a dominant role. If network connectivity is the dominant factor that impacts on service quality, then using the RFT
fault tolerance strategy described in Sect. 3.1 above may improve consumer side QoS,
but only for situations characterized by short-term outages or latency fluctuations.
When network connectivity suffers from long-term outages, the solution may involve
migrating the service to a different cloud infrastructure in a different geographical
location. However, if network connectivity is not a dominant factor and QoS degradation is caused by provider related issues, then RBFT and DFST fault tolerant
strategies may provide a solution by substituting alternative services at runtime.
H.T. Tran and G. Feuerlicht
Table 1. QoS data for eWay and PayPal payment services
Number of
of fails
Sao Paulo
time (s)
In order to differentiate between network connectivity and cloud service provider issues
we analyse the level of dependence between QoS parameters for the two payment
services (eWay and PayPal) at each location by calculating the correlation coefficient
for response time and availability. High level of dependence indicates that both payment services fail or suffer from increased response time at the same time, identifying
network connectivity as the main source of the problem. Low levels of correlation
indicate independent modes of failure for the two payment services, pointing to the
service provider as the cause of QoS fluctuations.
Table 2 shows the values of correlation coefficients of eWay and PayPal payment
services calculated for different locations. The correlation coefficient CðTe ;Tp Þ [18] of
response time between eWay and PayPal services is calculated as:
CðTe ;Tp Þ
ðTe Te ÞðTp Tp Þ
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðTe Te Þ2 ðTp Tp Þ2
Optimization of Cloud-Based Applications
Table 2. Response time and availability correlation coefficients for eWay and PayPal
Asia Pacific
South America Sao Paulo
Response time Availability
where Te is response time for eWay transaction, Tp is the response time for a concurrent
PayPal transaction, Te is the average response time of the eWay service and Tp is the
average response time of the PayPal service. The correlation coefficient CðTe ;Tp Þ for the
availability of eWay and PayPal is calculated as:
CðAe ;Ap Þ
ðAe Ae ÞðAp Ap Þ
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðAe Ae Þ2 ðAp Ap Þ2
where Ae is average availability of the eWay service, Ap is average availability of the
PayPal service computed for one hour, Ae is the average availability of eWay service
during the monitoring period, and Ap is the average availability of PayPal service
during monitoring period.
It is evident from the low correlation coefficient values in Table 2 that the underlying factors affecting response time and availability of the two payment services are
mutually independent over the monitored period. As the two payment services share
the same network connections, this indicates that the source of QoS variability is the
service provider system, rather than the network. This implies that improved QoS
values may be achievable by deploying RBFT and DFST service substitution fault
tolerant strategies [17]. We also note that in an environments characterized by reliable
low latency network connectivity the QoS values observed at the service consumer site
will approximate those published by the service provider.
Figures 3 and 4 show the hourly average response time and availability values for
eWay and PayPal services during the monitored period between 20th and 28th August
2016 for eleven geographic locations across the globe. Figure 3 shows that the
response time of eWay services is generally better than for PayPal and that the response
time of PayPal deployed in the US and Europe is better than those deployed in Asia
Pacific. Figure 4 shows that the availability of both services varies from 98.8 to 99.8%
with PayPal availability slightly better than that of eWay.
H.T. Tran and G. Feuerlicht
Fig. 3. Hourly average response times of eWay and PayPal services (20th to 28th August 2016)
Optimization of Cloud-Based Applications
Fig. 3. (continued)
Fig. 4. Hourly average availability of eWay and PayPal services (20th to 28th August 2016)
H.T. Tran and G. Feuerlicht
Fig. 4. (continued)
Optimization of Cloud-Based Applications
5 Conclusion
In this paper we have argued that consumer-side monitoring of QoS characteristics of
cloud services is essential to enable service consumers to make informed decisions
about service selection at design-time, and to maintain good run-time performance of
cloud-based enterprise applications. Service consumers need to supplement QoS
information published by cloud providers with data obtained independently using
consumer side monitoring taking into account location-based information, as the QoS
values measured at the consumer deployment site (i.e. at the site where the enterprise
application is running) may vary from those published by cloud service providers.
Our results obtained using AWS (Amazon Web Services) platforms deployed in
eleven sites across four geographic regions to monitor eWay and PayPal payment
services indicate that both services achieved availability values above 99.9% during
most of the measurement period 20th to 28th August 2016. It is evident from the low
correlation coefficient values that the underlying factors affecting response time and
availability of the two payment services are mutually independent. As the two payment
services share the same network connections, this indicates that the source of QoS
variability is the service provider system, rather than the network. This implies that
improved QoS values may be achievable by deploying RBFT and DFST service
substitution fault tolerant strategies. Using a combination of QoS information published
by cloud service providers and QoS data measured at different geographic locations by
service consumers, improves the understanding of performance and reliability
trade-offs and can facilitate the selection of more effective optimization strategies.
In our future work we plan to collect QoS data over an extended period of time to
give more reliable estimates of service availability and response time. We also plan to
make our monitoring database publicly available to cloud service consumers to enable
sharing of QoS information and to promote a collaborative effort with the aim to
improve the accessibility of cloud QoS information.
1. Wenmin, L., Wanchun, D., Xiangfeng, L., Chen, J.: A history record-based service
optimization method for QoS-aware service composition. In: 2011 IEEE International
Conference on Web Services (ICWS) (2011)
2. Safy, F.Z., El-Ramly, M., Salah, A.: Runtime monitoring of SOA applications: importance,
implementations and challenges. In: 2013 IEEE 7th International Symposium on Service
Oriented System Engineering (SOSE) (2013)
3. Aceto, G., Botta, A., De Donato, W., Pescapè, A.: Cloud monitoring: a survey. Comput.
Netw. 57(9), 2093–2115 (2013)
4. Lu, W., Hu, X., Wang, S., Li, X.: A multi-criteria QoS-aware trust service composition
algorithm in cloud computing environments. Int. J. Grid Distrib. Comput. 7(1), 77–88 (2014)
5. Noor, T.H., Sheng, Q.Z., Ngu, A.H., Dustdar, S.: Analysis of web-scale cloud services.
IEEE Internet Comput. 18(4), 55–61 (2014)
H.T. Tran and G. Feuerlicht
6. Tran, H.T., Feuerlicht, G.: Improving reliability of cloud-based applications. In: Aiello, M.,
Johnsen, E.B., Dustdar, S., Georgievski, I. (eds.) ESOCC 2016. LNCS, vol. 9846, pp. 235–
247. Springer, Heidelberg (2016). doi:10.1007/978-3-319-44482-6_15
7. Tsai, W.T., Zhou, X., Chen, Y., Bai, X.: On testing and evaluating service-oriented software.
Computer 41(8), 40–46 (2008)
8. Zibin, Z., Lyu, M.R.: A distributed replication strategy evaluation and selection framework
for fault tolerant web services. In: IEEE International Conference on Web Services, ICWS
2008 (2008)
9. Adams, M., Bearly, S., Bills, D., Foy, S., Li, M., Rains, T., Ray, M., Rogers, D., Simorjay,
F., Suthers, S., Wescott, J.: An introduction to designing reliable cloud services. Microsoft
Trustworthy Computing (2014). https://www.microsoft.com/en-au/download/details.aspx?
10. Zibin, Z., Zhou, T.C., Lyu, M.R., King, I.: Component ranking for fault-tolerant cloud
applications. IEEE Trans. Serv. Comput. 5(4), 540–550 (2012)
11. Reddy, C.M., Nalini, N.: FT2R2Cloud: Fault tolerance using time-out and retransmission of
requests for cloud applications. In: 2014 International Conference on Advances in
Electronics, Computers and Communications (ICAECC) (2014)
12. Deng, X., Xing, C.: A QoS-oriented optimization model for web service group. In: 8th
IEEE/ACIS International Conference on Computer and Information Science, ICIS 2009.
IEEE (2009)
13. Leitner, P., Hummer, W., Dustdar, S.: Cost-based optimization of service compositions.
IEEE Trans. Serv. Comput. 6(2), 239–251 (2013)
14. Lee, C., Lehoezky, J., Rajkumar, R., Siewiorek, D.: On quality of service optimization with
discrete QoS options. In: Proceedings of 5th IEEE Real-Time Technology and Applications
Symposium. IEEE (1999)
15. Fallon, L., O’Sullivan, D.: The AESOP approach for semantic-based end-user service
optimization. IEEE Trans. Netw. Serv. Manag. 11(2), 220–234 (2014)
16. Zheng, Z., Lyu, M.R.: Selecting an optimal fault tolerance strategy for reliable
service-oriented systems with local and global constraints. IEEE Trans. Comput. 64(1),
219–232 (2015)
17. Feuerlicht, G., Tran, H.T.: Service consumer framework: managing service evolution from a
consumer perspective. In: 16th International Conference on Enterprise Information Systems,
ICEIS-2014. Springer, Portugal (2014)
18. Microsoft: CORREL function (2016). https://support.office.com/en-us/article/CORRELfunction-995dcef7-0c0a-4bed-a3fb-239d7b68ca92. Cited 22 Aug 2016
Author Index
Badecka, Inez 309
Basl, Josef 156
Bernroider, Edward W.N.
Mladenow, Andreas 166
Mockus, Martynas 59
Nedomova, Lea 253
Neves, José 191
Neves, Mariana 191
Nguyen, Quang-Minh 32
Nguyen, Thanh-Tam 32
Nikander, Jussi 177
Novak, Niina Maarit 166
Nqampoyi, Vuvu 207
Cao, Tuan-Dung 32
Carta, Salvatore 263
Chen, Hong 103
Chen, Yong 103
Chlapek, Dušan 48
Decker, Reinhold 145
Dedić, Nedim 225
Doucek, Petr 77, 253
Fernandes, Ana 191
Feuerlicht, George 325
Figueiredo, Margarida 191
Saia, Roberto 263
Santos, Maribel Yasmina 237
Seymour, Lisa F. 207, 293
Skulimowski, Andrzej M.J. 309
Sokol, Pavol 112
Sookoo, Kavish 293
Stanier, Clare 225
Strauss, Christine 166
Stummer, Christian 145
Szabó, Ildikó 3
He, Wu 103
Hitz, Michael 16
Kessel, Thomas 16
Klat, Wilhelm 145
Kopčová, Veronika 112
Korczak, Jerzy 88
Kučera, Jan 48
Laar, David Sanka 207
Li, Ling 103
López, Gustavo 277
Maia, Nuno 191
Margiol, Sebastian 127
Marín-Raventós, Gabriela
Marreiros, Goreti 191
Martinho, Bruno 237
Maryska, Milos 253
Pacheco, Alexia 277
Pavlíček, Antonín 77
Pawełoszek, Ilona 88
Taudes, Alfred 127
Ternai, Katalin 3
Tran, Hong Thai 325
Van Belle, Jean-Paul 293
Vicente, Henrique 191
Xu, Li
Random flashcards

30 Cards


46 Cards

African nomads

18 Cards

Create flashcards