Data Mining in Finance
Sahil Kadam, Manan Raval

Data Mining in Finance
Sahil Kadam#1, Manan Raval#2
Undergraduate Student, Electronics And Telecommunications Engineering Department,
Dwarkadas J. Sanghvi College Of Engineering, Mumbai, India.
ABSTRACT: This paper is a survey on the blooming concept of
Data mining and its applications in the vast field of finance. Data
mining techniques provide a great aid in financial accounting and
fraud detection due to their classification and prediction abilities.
The aim of this paper is to show the various data mining
techniques used for Ongoing Concern and Financial Distress,
Fraud Management Bankruptcy Prediction, Credit Risk
Estimation and Corporate Performance Prediction. This paper
provides the relevant background knowledge, presents the
various Data Mining technique’s as well as implementing them in
computers, and applies them to current glitches in finance,
including the building and evaluating of trading models, fraud
management, bankruptcy prediction and the managing of risk.
Keywords- Data Mining, Finance, Risk management and fraud
management, credit assessment, and money laundering
analyses are main financial errands for data mining.
The naive approach to data mining in finance assumes that
somebody can provide a cookbook instruction on “how to
achieve the best result”. Certain academic investigators
continue to encourage this unjustified belief. In fact, the only
realistic approach proven to be successful is providing
comparisons between different methods showing their
strengths and weaknesses relative to problem characteristics
(problem ID) conceptually and leaving for user the selection
of the method that likely fits the specific user problem
circumstances. Our work tries to give an illustration of the
techniques used and area-specification of certain ones.
The term "data mining" refers to new methods for the
intelligent analysis of enormous groups of data. These
methods have emerged from numerous historically separate
fields, such as information systems, machine learning,
artificial intelligence, data engineering and knowledge
discovery. One of the most alluring application areas of these
emerging technologies is finance, becoming more amenable to
data-driven modelling as large sets of financial data become
available. Data Mining, also known as knowledge as data
discovery, is the process which involves studying and
analysing data from different sources and evaluating and
combing it into some useful and more important information information that can be used to maximize income and profits,
cost cutting, or both. In the recent years, Data Mining has
become an extremely important component in the life of
businesses and government.
Finance on the other hand is a broad term that describes two
related activities: the study of how money is managed and the
actual process of acquiring needed coffers. Businesses,
individual’s administrative government entities all need
funding to operate, hence, the field is often separated into
three sub-categories: personal finance, corporate finance and
public finance. Forecasting stock market, bank insolvencies,
currency exchange rate, managing and understanding financial
jeopardy, bank customer profiling, trade-off futures, loan
The different methods of data mining includes a large number
of algorithmic codes, techniques and models derived from the
gradual assimilation of data bases, machine learning
techniques , statistical knowledge and deep visualization. Most
of these methods are used to study complex financial data. The
various Data Mining methods that will be studied in this paper
are, Genetic Algorithms, Neural Networks, Rough Set Theory,
Decision Trees, Mathematical Programming and Case
Base Reasoning.
2.1 Neural networks in data mining
In more practical terms neural networks are non-linear
statistical data exhibiting tools. They can be used to model
complex relationships between inputs and outputs or to find
patterns in data. Using neural networks as a tool, data
warehousing firms are harvesting information from datasets in
the process known as data mining. The difference between
these data warehouses and ordinary databases is that there is
actual manipulation and cross-fertilization of the data helping
users makes more informed decisions.
The neurons are organized into layers. A network which is
layered in this manner consists of at least an input (first) and
an output (last) layer. Between these two layers there may
exist one or more hidden layers. Different kinds of NNs have a
different number of layers. Self-organizing maps (SOM) have
only an input and an output layer, whereas a back propagation
NN has additionally one or more hidden layers [1].
After defining the network architecture, the network
needs to be trained. In backpropagation networks a pattern is
applied to the input layer and a final out- put is calculated at
the output layer. This yield is compared with the desired result
and the errors are propagated backwards in the NN by tuning
the weights of the connections. This process iterates until an
acceptable error rate is reached. The backpropagation NNs
have become popular for prediction and classification
selection and mutation on a primarily unsystematic population
in order to compute a whole generation of fresh strings.
SOM is a gathering and visualization method of
unsupervised learning. For each input vector, only one output
neuron will be activated. The winner’s weight vector is
updated to correspond with the input vectors. Thus, similar inputs will be mapped to the same or neighboring output
neurons forming clusters. Two commonly used SOM
topologies are the rectangular lattice, where each neuron has
four neighbors and the hexagonal lattice where each neuron
has six neighbors.
2) Crossover, where two chromosomes reciprocally exchange
some bits creating new chromosomes.
It seems that NNs attract the interest of most arithmetic
investigators in the area of our concern. Their structure and
working principles enable them to deal with problems where
an effective algorithm based solution is not applicable. Since
they learn from examples and generalize to new observations
they can classify previously unseen patterns. They have the
ability to deal with incomplete, ambiguous and noisy data.
Unlike traditional statistical techniques they do not assume a
priori about the data distribution properties, neither do they
assumed independent input variables.
2.2 Genetic Algorithms
Genetic Algorithm (GA) was developed by Holland in 1970.
This includes Darwinian evolutionary theory with sexual
reproduction. GA is stochastic search algorithm exhibited on
the process of usual selection, which highpoints organic
fruition. GA has been successfully applied in many
optimization, search, and machine learning problems. GA
process in a repetition manner by producing novel populations
of strings from ancient strings.
Each string is the prearranged binary, real etc., version of an
entrant solution. An assessment function associates a fitness
measure to every string demonstrating its fitness for the
problem. Normal GA uses genetic operatives such crossover,
Three operators are applied to chromosomes:
1) Reproduction, where the individuals self-multiply by
duplicating themselves with a probability similar to their
fitness value.
3) Mutation, which works on a single chromosome by altering
one or more bits. The probability of mutation is very squat.
2.3 Decision tree
Decision tree is a categorizing and forecasting method, which
sequentially distributes observations into mutually exclusive
subcategories. The method searches for the trait that best splits
the samples into discrete classes. Subcategories are
sequentially separated until the subcategories are too small or
no noteworthy statistical difference occurs between
candidate subsets. If the decision tree becomes too large, it is
finally hacked off.
2.4 Particle Swamp Optimization
The examination and study on the biologic colony revealed
that intelligence generated from intricate activities can deliver
well-organized answers for precise optimization difficulties.
Enthused by the social conduct of animals such as fish
schooling and bird grouping, Kennedy and Eberhart designed
the Particle Swarm Optimization (PSO) in 1995. The basic
PSO model involves a group of particles stirring in a ddimensional search space. The course and distance of every
particle in the hyper-dimensional space is determined by its
capability and speed. Generally, the capability is principally
connected with the optimization goal and the speed is
reorganized as per an erudite rule.
Thus, artificial neural networks (ANNs) are a non-linear
statistical model based on the working of the human brain.
They are influential tools for unidentified data relationship
modeling. Artificial Neural Networks are able to identify the
multifaceted outline between input and output variables then
forecast the result of new self-governing input data.
2.5 Rough Set Theory
Rough Set Theory (RST) was introduced by Pawlak (1982).
RST extents set theory with the notion of an element’s
possible membership in a set. Given a class C, the lower
approximation of C consists of the samples that certainly belong to C. The upper approximation of C consists of the
samples that cannot be defined as not belonging to C. RST
may be used to describe dependencies be- tween attributes, to
evaluate significance of attributes, to deal with inconsistent
data and to handle uncertainty (Dimitras et al.1999)[1].
To date, data mining has become a promising solution for
identifying vibrant and nonlinear relations in financial
statistics. It has been applied to diverse financial areas
including stock forecasting, portfolio management and
investment risk analysis, prediction of bankruptcy and foreign
exchange rate, detections of financial fraud, loan payment
prediction, customer credit policy analysis, and so on. In this
paper, we primarily focus on the first five applications in the
above list, which have mostly been discussed in the literature.
3.1 Ongoing Concern and Financial Distress
According to SAS 59, the auditor has to evaluate the ability of
his/her customer to continue as a GC for at least one year past
the balance sheet data. If there are signs that the client
corporation will face financial problems, which may lead to a
catastrophe, the auditor has to issue a going concern report.
The assessment of the going concern status is not a tranquil
task. Studies indicate that only a relative minor share of failed
companies have been capable on a going concern basis (Koh
2004). To enable the auditors on the going concern report
delivering task, statistical and machine learning methods have
been projected.
Koh (2004) compared backpropagation NN, Decision
Trees and logistic regression methods in a going concern
prediction study. The data sample constrained 165 going
concern companies and 165 matched non going concern
businesses. 6 selected financial ratios have been used as input
variables. The author reported that Decision Trees
outperformed the other two methods.
Tan and Dihardjo (2001) built upon a previous study of
Tan, which tried to predict financial distress for Australian
credit unions by using NNs. In his previous study Tan used
quarterly financial data and tried to predict distress in a quarter
base. Tan and Dihardjo improved the method by introducing
the notion of “early detector”. When the model predicts that a
credit union will go distressed in a particular quarter and the
union actually goes distressed in a next quarter, in a maximum
of four quarters, the quarter is labeled as “Early Detector”.
This improved method performed better than the previous one
in terms of Type II errors rate. 13 financial ratios were used as
input variables and a sample of 2144 observations was used.
The results were compared with those of a Probit model and
were found marginally better especially for the Type 1 error
rate. Konno and Kobayashi (2000) proposed a method for
enterprise rating by using Mathematical Programming
techniques. The method made no distribution assumptions
about the data. Three alternatives based on discrimination by
hyper plane, discrimination by quadratic surface and
discrimination by elliptic surface were employed. 6 financial
ratios derived from financial statements were used as input
variables. The data sample contained 455 enterprises. The
method calculated a score for each enterprise.
3.2 Fraud Management
Management fraud is the deliberated fraud committed by
managers through falsified financial statements. Management
fraud injures tax authorities, share- holders and creditors.
Spathis (2002) developed two models for identifying
falsified financial statement from publicly available data.
Input variables for the first model contain 9 financial ratios.
For the second model z-score is added as input variable to
accommodate the relationship between financial distress and
financial statement manipulation. The method used is logistic
regression and the data sample contained 38 FFS and 38 non
FFS firms. For both models the results show that 3 variables
with significant coefficients entered the model.
3.3 Bankruptcy Prediction
Predicting bankruptcy is of great benefit to those who have
some relations to a firm concerned, for bankruptcy is a final
state of corporate failure. In the 21st Century, corporate
bankruptcy in the world has reached an unprecedented level. It
results in huge economic losses to companies, stockholders,
employees, and customers, together with tremendous social
and economic cost to the nation. Therefore, accurate
prediction of bankruptcy has become an important issue in
finance. Companies are strongly demanding explanations for
the logic of prediction. They find it more acceptable to hear,
for instance, that the prediction is produced based on
computer-generated rules than to hear that the decision is
made by an advanced technique that offers no explanation.
The breakthrough bankruptcy prediction model was the Zscore model developed by Altman. The five-variable Z-score
model using multiple discriminant analysis showed very
strong predictive power. Since then, the discriminant analysis
has been approved to be the most widely accepted and
successful method in bankruptcy prediction literature. In
addition, numerous studies have tried to develop different
bankruptcy prediction models by applying other data mining
techniques including logistic regression analysis, genetic
algorithms, decision trees, classification and regression trees
(CART), and other statistical methods. Those techniques can
generally provide good interpretability of the prediction
models. In the past two decades, a number of studies have also
applied neural network approach to bankruptcy prediction,
most centering on the comparison of predictive performance
of neural networks and other methodologies such as
discriminant analysis and logic analysis. Some have reported
that the performance of neural networks is slightly better than
that of other techniques, but results are contradictory or
Although neural networks and statistical models have been
used for bankruptcy prediction, they may encounter the
problem of unequal frequencies of the two states of interest,
which creates at least two major obstacles in evaluating the
network predictive performance. The first issue involves the
impact of unequal frequencies of the two states (e.g.,
bankruptcy versus not bankruptcy) on training a neural
network or estimating the parameters of statistical models.
Drawing random samples from unbalanced populations will
likely yield samples that contain an overwhelming majority of
one state of interest.
Consequently, the decision performance of neural networks or
statistical models may be poor while being tested in realistic
situations. To overcome this problem, researchers have
selected choice-based sampling technique in which the
probability of an observation entering the sample depends on
the value of the dependent variable. The second problem
involves evaluating the accuracy of various decision models.
The percentage of observations correctly classified can be
very misleading with unbalanced samples. In general, training
a neural network with balanced samples in applications such
as bankruptcy prediction can enable the network to familiarize
itself with the infrequent state of interest. Neural networks
trained unbalanced samples provide the best results while
being tested under realistic conditions. Jain and Nag
constructed several training samples with different
composition. They compared the performance of a neural
network that was trained on a balanced sample and the
performance of another neural network trained on more
representative samples. The weighted efficiency measure was
the highest for the former network and decreased when the
networks were trained using samples representative of the
3.4 Credit Risk Estimation
The task of credit risk analysis becomes more demanding due
to the increased number of bankruptcies and the competitive
offers of creditors. DM techniques have been applied to
facilitate the estimation of credit risk.
Huang et al. (2003) performed credit rating analysis by
using Support Vector Machines (SVMs), a machine learning
technique. Two data sets were used; one containing 74 Korean
firms and the other containing 265 US firms. For both data
sets 5 rating categories were defined. Two models for Korean
data set and two models for US data set, each one having a
different input vector were built.
SVMs and a backpropagation NNs were used to predict credit
rating. SVMs performed better in the three of the four models.
Another consideration of the study was to interpret the NN.
The Garson method was used to measure the relative
importance of the input values.
Mues et al. (2004) used decision diagrams to visualize
credit risk evaluation rules. Decision diagrams have the
theoretical advantage over decision trees that they avoid the
repetition of isomorphic sub-trees. Two data sets, one
containing German data and two containing Benelux data
were used. A NN was employed to perform the classification.
The rule extraction methods Neurorule and Trepan were
applied to extract rules from the network.
3.5 Corporate Performance Prediction
Lam (2003) developed a model to predict the return rate on
common shareholders equity. She used backpropagation NNs
and inferred rules from the weights of the connections by
applying the GLARE algorithm. The input vector included 15
financial statement ratios and 1 technical analysis variable. In
an additional experiment 11 macroeconomic variables were
also included. The data sample contained 364 firms.
Back et al. (2001) developed two models to cluster
companies according to their performance. Both models used
SOMs. The first model operated over financial data of 160
companies. By employing text mining techniques, the second
model analyzed the CEOs’ annual report of the companies.
The authors concluded that there are differences between the
clustering results of the two methods.
Two models were developed, one analyzing financial ratios
and the other analyzing the CEOs’ reports. In this study a
different method, the Prototype-Matching Text Clustering,
was used for analyzing the reports. By comparing the results
of the qualitative and the quantitative methods the authors
concluded that the text reports tend to foresee changes in the
financial state before these changes explicitly influence the
financial ratios.
DM methods have categorization and forecasting abilities
which can enable the decision making procedure in financial
difficulties. The financial and pforecasting tasks in the
composed literature address the topics of bankruptcy
prediction, credit risk estimation, going concern reporting,
financial distress, corporate performance prediction and
management fraud. Bankruptcy prediction seems to be the
most popular application area.
The data mining methods employed in the collected
literature include Genetic Algorithms, NNs, , Rough Set
Theory, Decision Trees, and Mathematical Programming.
Most of the researches seem to prefer the Neural Network
Although a substantial quantity of research work has
address the application of DM techniques in finance there are
many fertile areas for further re- search.
The introduction of hybrid models, the improvement of
existing models, the extraction of comprehendible rules from
Neural Networks, the improvement of performance and the
integration of ERP systems with DM tools are some possible
future research directions.
In terms of the data used the enrichment of the input
vector with qualitative information and the usage and
evaluation of formal methods for feature selection and data
discretization are open research possibilities.
The future is open. Further research effort will improve
models and methods making DM an even more valuable tool
in finance and accounting.
