Uploaded by cbbxxc

Performance Analysis of Financial Institution Operations in a NoSQL Columnar Database

advertisement
Performance Analysis of Financial
Institution Operations in a NoSQL
Columnar Database
Felipe de Moura Rezende dos Santos
Maristela Holanda
University of Brasilia - UnB,
Brasilia, Federal District - Brazil
feliperezende86@gmail.com
Abstract — The advent of Big Data has brought the need to analyse
large volumes of data. The Hadoop ecosystem has been used to store
and process this Big Data. Hadoop is composed of different
frameworks, such as: HBase, a NoSQL column storage structure
database; Hive framework to develop the data warehouse in
distributed storage with SQL, used in Business Intelligence; Sqoop, to
transfer data between Apache Hadoop and relational database
structures; and others. The objective of this paper is to analyse the use
of the Hadoop Ecosystem in a financial system to reduce the
processing time that its transactions require. Results show that the
Hadoop ecosystem improves the processing time in that financial
database.
Keywords - Big Data, HBase, Hadoop, NoSQL Columnar, Hive.
I.
INTRODUCTION
An accounting system contains the records of the financial
and economic events of the organization. It has the purpose of
organizing and summarizing information that can be consulted
at any time, besides the economic profile of a certain period [1].
Management accounting is a process of identifying, measuring,
analysing and communicating financial information used by
management to plan and control a company to ensure the
appropriate use of its resources. In addition, to aid in the
decision making of the company, managerial accounting serves
as the basis for the control [2]. Management accounting
provides necessary information for the management and
development of a company.
The handling of large amounts of data called Big Data is
increasingly a relevant issue to consider. The efficiency of
storage tools for the processing or mining of such data [3] has
been the subject of much research. One of these tools is NoSQL,
in structured column [4] [5] [6] [7] version, performs well for
data warehousing in visualizing the results in various levels of
granularity and dimensions. The columnar database performs a
vertical positioning of the table and stores each column
separately, presenting an additional benefit: better compaction
based on type [8].
The general objective of this work is to generate
management accounting reports for the financial institution
Beta, using a NoSQL columnar database with frequency greater
than the current one. Furthermore, a modelling in a NoSQL
University of Brasilia - UnB,
Brasilia, Federal District - Brazil
.....................mholanda@unb.br
Columnar is proposed, for storage of the values of the
managerial results of the Financial Institution Beta. Finally, the
performance of the process of calculation of managerial results
is analysed.
This paper is divided as follows: background in Section II;
related works are presented in Section III; Section IV describes
the implementation of the problem solution proposal; the results
obtained are described in Section V; Section VI contains the
conclusion and future work.
II.
BACKGROUND
A. Not only SQL (NoSQL) Database
Big data is a combination of structured, semi-structured
and/or unstructured data on a large scale [9]. NoSQL does not
replace relational databases, instead, these databases
complement each other.
In several studies, as in [10] [11] [7], one of the main
characteristics of Big Data is the presence of the 4 Vs. They
consist of Volume, Veracity, Velocity and Variety. Present in
[12] and [13] a fifth dimension, in the case of Big Data analysis:
Value. Data analysis without generating value does not offer a
contribution to the organization. Ma-Lin Song et al. [14] put the
4 Vs (Volume, Veracity, Velocity, and Variety) as Big Data
characteristics, but adds Valorization describing how 5 Vs add
significantly to the complexity of solving relevant problems.
Nayak et al. [15] describe how NoSQL can be categorized
into 4 types:
•
Document Store Database: Document databases use a
document-oriented model to store data. They store a
record and its associated data in a single structure called
a document. Each document contains several
associated attributes and values [16]. Documents can
be retrieved based on attribute values using various
application programming interfaces (APIs) or query
languages provided by the DBMS (Database
Management System).
•
Graph Database: Graph databases are database
management systems that have been optimized to store,
query, and update structures in graphs. Graphs consist
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
to reduce the I/O of the system; and, processing is
simultaneous i.e. each column is processed by a
process;
of nodes and edges, where nodes act as objects and
edges act as relationships between objects [17].
•
Key-value Store Database: The databases belonging to
this group are essentially distributed hash tables that
provide at least two operations: get (key) and put (key,
value). A key-value database maps data items to a key
space that is used to allocate key/value pairs to
computers and efficiently locate a value considering its
key. These databases are designed to scale to terabytes
or even petabytes, plus millions of concurrent
operations adding computers horizontally.
•
Wide Column Store Database: Advances in
information technology have led to an increasing
amount of available data, creating demand for their
analysis using algorithms from a variety of
methodologies, e.g. statistics, clustering and
forecasting. Such analysis is typically performed in the
database system as it provides optimized execution and
efficient scheduling of the data volume used [18].
•
the databases have the same data type, similar
characteristics and good compression ratio.
B. HBase
Similar to HDFS and MapReduce [21], HBase also adopts
master-slave architecture - HMaster (master) is responsible for
assigning regions to HRegionServers (slaves). An
HRegionServer is responsible for managing customer requests
for reading and writing. HBase uses Zookeeper, to manage the
HBase cluster. SQL query language is not supported in HBase.
However, there is a Hive/HBase integration project that allows
SQL statements written in Hive to access HBase tables. There
is also the HBql project that adds a dialect of SQL and JDBC
connections to HBase.
Big Data has become very important in many organizations,
whether public or private [19]. However, unusual data volumes
become a problem when faced with the limited capabilities of
traditional systems, especially when data storage is in a
distributed environment that requires the use of parallel
processing, as the MapReduce paradigm.
HBase, an Apache open source project, is a fault-tolerant,
high-scalable, column-driven distributed system, NoSQL
database built under the HDFS. It is used for readings and
random-access writes on very large databases [21]. All access
to the data is done through the primary key and any checking of
the results of the HBase table in a MapReduce job. Parallel
verification in terms of work shows that MapReduce's query
response time is faster and has better overall performance.
Wide Column, Family Column, or Columnar databases
store data by columns and do not impose a rigid schema on user
data. This means that some rows may or may not have columns
of a certain type. The columnar database has a graphical
representation similar to the relational database. The main
difference between them is with respect to null values.
Considering a use case with different attribute types, a relational
database should store each null value in its respective column.
Conversely, a columnar database saves only the key-value of
the field if it is really needed [20].
C. Sqoop
Sqoop is a method of loading data in HBase, also present in
the Hadoop Ecosystem [22]. Sqoop was efficiently designed for
the purpose of large-volume data transfer between Apache
Hadoop and relational database structures [23]. It quickly
copies external systems, including enterprise data warehousing
to Hadoop. It ensures fast performance by parallelizing data
transfer and using the ideal system. Sqoop supports data
analysis efficiently. In addition, it even mitigates excessive
loads for external systems.
The row key is used to identify the column values belonging
to the same tuple. The column name allows to identify the
attribute of a value, it can be composed of the column family
name and the column name. In fact, the column can be complex
or simple. If the column name is prefixed, this means that the
column name consists of the column family name (prefix) and
the name of the nested column. In this case, it is called a
composite attribute (it belongs to a family of columns),
otherwise it is considered a simple attribute. Finally, the
timestamp allows the coherence of the data to be checked. Each
value receives a system time and date record for the data
consistency proposition.
Column-oriented databases use tables as a data model, but
do not support table association. Column-oriented databases
have the following characteristics [16]:
•
the data is stored per column, i.e. the data is stored
separately for each column;
•
each data column is the index of the database; only the
columns involving the result of the queries are accessed
III.
RELATED WORK
Ameya Nayak et al. [15] mentions a problem with the
relational data model of scalability and that its performance is
degraded rapidly as data volumes increase. From 2000 some
studies [24] [25] [26] demonstrate the change of these databases
to a new structure, with a differential for Big Data processing.
This new structure, known as NoSQL (Not Only SQL), has a
horizontal scalability. This means that to add capacity, a
database administrator can simply add more merchandise
servers or cloud instances. The database automatically spreads
the data through the servers as needed [27]. Finally, it completes
content and organizational editing before formatting.
Big Data is a powerful and demanding popular technology
nowadays. Big Data represents rapidly growing data with
variety, generation speed, and volume of information that is not
used by existing analytical applications and data warehouse
systems [28].
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
Figure 1. Proposed macro data load process in NoSQL.
Rashid Zafar et al. in [29], compare two database management
structures, describing existing models in NoSQL, as well as
examples of DBMS. As a conclusion, they give an example of a
gain for Netflix with the migration of the Oracle system to
Cassandra. After the conversion, the company had a writing
capacity of more than ten thousand lines per second in its
database. NoSQL systems are basically used for applications
requiring high-performance data reliability and run on multiple
nodes connected to a cluster.
A comparison between the two DBMSs is found in [30] which
demonstrates a comparison involving four fundamental principles
of a database: Instance, Reading, Writing and Deletion. While the
NoSQL DBMS generally had optimized results, SQL DBMS did
not have the same result. However, it is noted that not all the
NoSQL had the same result. Therefore, one must select the correct
type of NoSQL to optimize/improve storage and information
processing.
IV.
PROJECT IMPLEMENTATION
One of the major problems encountered in the current way in
which the Financial Institution's management results are
processed, in addition to the processing time, is the need to wait
for the end of the reference month to have the value of the
financial institution's realized results of operations. The main
advantage of changing to a system that supports fast processing
of large data is the ability to perform processing of this
information on a daily basis. It allows the previous result values
to be stored with a one-day lag - no waiting for the month to close.
As found in the literature, for example in [6] and [31], one of
the best options for processing large-data management results
calculations is the storage in the NoSQL Columnar Database.
Given that the Beta Financial Institution has a contract with
Apache and thus has available for use the Hadoop Distributed File
System - HDFS, using the Hadoop ecosystem and the columnar
NoSQL HBase, these are precisely the structures that are used in
the work.
The proposed solution to the problem raised in Section I is the
implementation of data storage with the necessary values for the
calculation of the management results in columnar NoSQL,
followed by the processing of this data and the calculation of the
management results, with the storage of these results obtained in
the environment Big Data.
Each day each system generates approximately 800 MB of
data. As there are 26 systems, approximately 20.8 GB of data have
to be processed and stored daily for this project. For this work, all
operations of all systems of the financial institution Beta were
used (per policy of the financial institution, the names of the
systems cannot be described in academic works, external to the
institution). The data is stored in three distinct tables, each with a
distinct group of information, but to meet the same attributes as
in today's process, only the specific attributes of each table were
captured. That is, thirteen attributes from the first table, twentyseven attributes from the second table and twenty-eight attributes
from the last table.
Figure 1 demonstrates the proposed processing as a new
format for performing the managerial calculation of the
institution's operations. The search for information in the
Database of Operations and External Sources would continue
in the same way, with the difference that there would be no
prior calculation of these values of management results.
Instead, the data would be entered into the database in
columnar NoSQL environment for information processing.
All operations go through the process of grouping
information, as they have components generated for the same
operation in both databases.
The calculation of management results is performed in the
Big Data environment using Pig Script and Python calculation
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
methodologies, the results are stored in Hive for result
disclosure in specific panels.
One of the differences in this new processing is the
absence of management result calculations in three distinct
phases, as in the previous process. In this new process, the raw
data are loaded in Hadoop environment and stored in HBase.
After that the calculations of the managerial result of the
operations are processed, still in Big Data environment. And
the results are stored in a Data Warehouse so that they can be
evidenced in bulletin boards.
Figure 2. Proposed solution to the problem.
Figure 2 presents the abstract architecture for the proposal.
The proposal is divided into three main modules:
•
Data transfer: To perform information processing in
the Big Data environment, data loading in the
Hadoop environment is required. This process,
called Sqoop, is performed by the structure and is
already described above. A Shell Script file was
created for each processing [32]. An incremental
Sqoop process is performed daily where all
operations that have changed in their driving state are
updated (if the operation was liquidated, no longer in
normal condition or was contracted on this day). An
update of balances for calculating management
results is performed daily, overwriting the existing
values.
•
Processing: With data stored in a Hadoop
environment, you can process management result
calculations. For this procedure, there is the Pig Latin
structure already described in this work and a file
was created that should be executed daily after
performing the Sqoop processes with increment of
updated values. Some calculation methodologies
cannot be executed in Pig Latin, so two Python [32]
programs were created for these executions.
•
Storage: After all processing of the information
described in the previous item, the management
result data of the financial institution's credit
operations should be stored in a Hive environment,
excellent for Data Warehouse processes, as
described in several studies already referenced in this
paper.
V.
RESULTS
During the month of July 2019, the processes for capturing
data from the relational database, necessary for the management
calculation. These data were also processed by Hadoop-HBase
using Pig, and finally the managerial results were loaded into a
Hive database. The processes were executed at different times of
the day to verify the processing characteristics of the Hadoop
ecosystem configured in the institution. The execution times were
chosen at random so that it could be analyzed if there would be a
change in performance according to the time of day.
The first tests carried out were related to the data transfer
process. For these tests, a shell script was created. This script
describes the three Sqoop processes for sending data from the
relational environment to Hadoop - HBase. The first two
processes concern the operational information tables. An update
of the operations is carried out only when the driving status was
changed on the previous business day, since the driving status
identifies the operations that are still in force at the institution.
The third process, referring to the balance table, performed the
daily insertion of the data available in a relational database,
overwriting the values previously stored in HBase. It required a
total time of 1 hour 53 minutes and 46 seconds for the full loading
of the three relational databases for HBase using Sqoop.
As shown in Figure 2, after carrying out the data transfers, and
storing them in HBase, the command for processing the
management result is passed, using Apache Pig. To calculate the
total processing time of the managerial result values, the
difference is calculated between the moment the process ends and
the moment it starts. This process in Pig Latin performs the
reading of the data stored in HBase, processes the data with some
methodologies applied in Python to determine managerial results
and loads the values in the Hive database. For the purpose of final
performance analysis, we used the median time, and for this
process, the value was 34 minutes and 13 seconds.
All the processes were executed with manual commands, not
being scheduled to run during quieter periods of the available
servers.
Figure 3 demonstrates the time result for processing the daily
information. This is an average of the days processed. The time
required to perform all processing and loading of the new data
was 16 minutes. As a finalization of the process, a loading of the
data obtained in Hive is carried out to be used as a data warehouse
for a better disclosure of the values obtained. The average total
operations with the results was 35.5 million.
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
[5]
[6]
[7]
[8]
[9]
Figure 3. Processing Result.
VI.
CONCLUSION AND FUTURE WORKS
We can conclude from this study that the processing this
massive data of managerial information of the operations of
the financial institution represents a great improvement in the
time necessary for final disclosure. This study is still in the
initial stage, it does not contain the generation of all
components of managerial results which are required for it to
be used as generator of the tracking numbers of the operations
of the financial institution. But it shows that it would be
possible to continue with the insertion of these remaining
values in order to have a better monitoring of managerial
results.
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Figure 4. Processing Result.
[19]
As future work, it would be possible to incorporate all of
the necessary components for the managerial results of
operations of the financial institution. In addition, new
processes could be developed for the creation of specific
results required by the group of analysts who will follow the
presented values.
[20]
REFERENCES
[23]
[1]
[2]
[3]
[4]
Salazar, José Nicolás Albuja, and Gideon Carvalho de Benedicto.
Contabilidade financeira. Cengage Learning Editores, 2004.
Nikolay, Rafael, and Luiz Fernando Costa Neves. "Contabilidade gerencial
como base à controladoria." Revista Eletrônica do Curso de Ciências
Contábeis 5.9 (2016): 55-80.
Zhang, J., Yao, X., Han, G., & Gui, Y. (2015, October). A survey of recent
technologies and challenges in big data utilizations. In 2015 International
Conference on Information and Communication Technology Convergence
(ICTC) (pp. 497-499). IEEE.
Han, J., Haihong, E., Le, G., & Du, J. (2011, October). Survey on NoSQL
database. In 2011 6th international conference on pervasive computing and
applications (pp. 363-366). IEEE..
[21]
[22]
[24]
[25]
[26]
[27]
Dehdouh, Khaled, Omar Boussaid, and Fadila Bentayeb. "Columnar
NoSQL star schema benchmark." International Conference on Model and
Data Engineering. Springer, Cham, 2014.
Su, F., Wang, Z., Yang, S., Li, K., Lu, X., Wu, Y., & Peng, Y. (2017, April).
A survey on big data analytics technologies. In International Conference on
5G for Future Wireless Networks (pp. 359-370). Springer, Cham.
Siddiqa, Aisha, Ahmad Karim, and Abdullah Gani. "Big data storage
technologies: a survey." Frontiers of Information Technology & Electronic
Engineering 18.8 (2017): 1040-1070.
Sridhar, K. T. "Modern column stores for big data processing." International
Conference on Big Data Analytics. Springer, Cham, 2017.
Bathla, Gourav, Rinkle Rani, and Himanshu Aggarwal. "Comparative study
of NoSQL databases for big data storage." International Journal of
Engineering & Technology 7.26 (2018): 83.
Dehdouh, Khaled. "Building OLAP cubes from columnar NoSQL data
warehouses." International Conference on Model and Data Engineering.
Springer, Cham, 2016.
Oussous, Ahmed, et al. "Big Data technologies: A survey." Journal of King
Saud University-Computer and Information Sciences 30.4 (2018): 431-448.
Chiang, Roger HL, et al. "Strategic value of big data and business analytics."
(2018): 383-387.
Grover, V., Chiang, R. H., Liang, T. P., & Zhang, D. (2018). Creating
strategic business value from big data analytics: A research framework.
Journal of Management Information Systems, 35(2), 388-423.
Song, Ma-Lin, et al. "Environmental performance evaluation with big data:
Theories and methods." Annals of Operations Research 270.1-2 (2018):
459-472.
Nayak, Ameya, Anil Poriya, and Dikshay Poojary. "Type of NoSQL
databases and its comparison with relational databases." International
Journal of Applied Information Systems 5.4 (2013): 16-19.
Gessert, Felix, et al. "NoSQL database systems: a survey and decision
guidance." Computer Science-Research and Development 32.3-4 (2017):
353-365.
Amghar, Souad, Safae Cherdal, and Salma Mouline. "Which NoSQL
database for IoT Applications?." 2018 International Conference on Selected
Topics in Mobile and Wireless Networking (MoWNeT). IEEE, 2018.
Mattis, T., Henning, J., Rein, P., Hirschfeld, R., & Appeltauer, M. (2015,
October). Columnar objects: Improving the performance of analytical
applications. In 2015 ACM International Symposium on New Ideas, New
Paradigms, and Reflections on Programming and Software (Onward!) (pp.
197-210).
Dehdouh, Khaled. "Building OLAP cubes from columnar NoSQL data
warehouses." International Conference on Model and Data Engineering.
Springer, Cham, 2016.
Hecht, Robin, and Stefan Jablonski. "NoSQL evaluation: A use case
oriented survey." 2011 International Conference on Cloud and Service
Computing. IEEE, 2011.
Vora, Mehul Nalin. "Hadoop-HBase for large-scale data." Proceedings of
2011 International Conference on Computer Science and Network
Technology. Vol. 1. IEEE, 2011.
Vohra, Deepak. "Using apache sqoop." Pro Docker. Apress, Berkeley, CA,
2016. 151-183.
Aravinth, S. S., Begam, A. H., Shanmugapriyaa, S., Sowmya, S., & Arun,
E. (2015). An efficient HADOOP frameworks SQOOP and ambari for big
data processing. International Journal for Innovative Research in Science
and Technology, 1(10), 252-255.
Sareen, Pankaj, and Parveen Kumar. "Nosql database and its comparison
with sql database." International Journal of Computer Science &
Communication Networks 5.5 (2015): 293-298.
Dai, Jiao. "SQL to NoSQL: What to do and How." IOP Conference Series:
Earth and Environmental Science. Vol. 234. No. 1. IOP Publishing, 2019.
Silva, Yasin N., Isadora Almeida, and Michell Queiroz. "SQL: From
traditional databases to big data." Proceedings of the 47th ACM Technical
Symposium on Computing Science Education. 2016.
Sareen, Pankaj, and Parveen Kumar. "Nosql database and its comparison
with sql database." International Journal of Computer Science &
Communication Networks 5.5 (2015): 293-298.
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
[28] Solanke, Ganesh B., and K. Rajeswari. "SQL to NoSQL transformation
system using data adapter and analytics." 2017 IEEE International
Conference on Technological Innovations in Communication, Control and
Automation (TICCA). IEEE, 2017.
[29] Zafar, R., Yafi, E., Zuhairi, M. F., & Dao, H. (2016, May). Big data: the
NoSQL and RDBMS review. In 2016 International Conference on
Information and Communication Technology (ICICTM) (pp. 120-126).
IEEE.
[30] Li, Yishan, and Sathiamoorthy Manoharan. "A performance comparison of
SQL and NoSQL databases." 2013 IEEE Pacific Rim Conference on
Communications, Computers and Signal Processing (PACRIM). IEEE,
2013.
[31] Sridhar, K. T. "Modern column stores for big data processing." International
Conference on Big Data Analytics. Springer, Cham, 2017.
[32] https://github.com/feliperezende86/PPCA.git
2020 15th Iberian Conference on Information Systems and Technologies (CISTI)
24 – 27 June 2020, Seville, Spain
ISBN: 978-989-54659-0-3
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on May 20,2022 at 19:56:06 UTC from IEEE Xplore. Restrictions apply.
Download