International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 8 – Oct 2014 Data Mining in Finance Sahil Kadam#1, Manan Raval#2 # Undergraduate Student, Electronics And Telecommunications Engineering Department, Dwarkadas J. Sanghvi College Of Engineering, Mumbai, India. ABSTRACT: This paper is a survey on the blooming concept of Data mining and its applications in the vast field of finance. Data mining techniques provide a great aid in financial accounting and fraud detection due to their classification and prediction abilities. The aim of this paper is to show the various data mining techniques used for Ongoing Concern and Financial Distress, Fraud Management Bankruptcy Prediction, Credit Risk Estimation and Corporate Performance Prediction. This paper provides the relevant background knowledge, presents the various Data Mining technique’s as well as implementing them in computers, and applies them to current glitches in finance, including the building and evaluating of trading models, fraud management, bankruptcy prediction and the managing of risk. Keywords- Data Mining, Finance, Risk management and fraud detection. management, credit assessment, and money laundering analyses are main financial errands for data mining. The naive approach to data mining in finance assumes that somebody can provide a cookbook instruction on “how to achieve the best result”. Certain academic investigators continue to encourage this unjustified belief. In fact, the only realistic approach proven to be successful is providing comparisons between different methods showing their strengths and weaknesses relative to problem characteristics (problem ID) conceptually and leaving for user the selection of the method that likely fits the specific user problem circumstances. Our work tries to give an illustration of the techniques used and area-specification of certain ones. 1. DATA MINING TECHNIQUES 1. INTRODUCTION: The term "data mining" refers to new methods for the intelligent analysis of enormous groups of data. These methods have emerged from numerous historically separate fields, such as information systems, machine learning, artificial intelligence, data engineering and knowledge discovery. One of the most alluring application areas of these emerging technologies is finance, becoming more amenable to data-driven modelling as large sets of financial data become available. Data Mining, also known as knowledge as data discovery, is the process which involves studying and analysing data from different sources and evaluating and combing it into some useful and more important information information that can be used to maximize income and profits, cost cutting, or both. In the recent years, Data Mining has become an extremely important component in the life of businesses and government. Finance on the other hand is a broad term that describes two related activities: the study of how money is managed and the actual process of acquiring needed coffers. Businesses, individual’s administrative government entities all need funding to operate, hence, the field is often separated into three sub-categories: personal finance, corporate finance and public finance. Forecasting stock market, bank insolvencies, currency exchange rate, managing and understanding financial jeopardy, bank customer profiling, trade-off futures, loan ISSN: 2231-5381 The different methods of data mining includes a large number of algorithmic codes, techniques and models derived from the gradual assimilation of data bases, machine learning techniques , statistical knowledge and deep visualization. Most of these methods are used to study complex financial data. The various Data Mining methods that will be studied in this paper are, Genetic Algorithms, Neural Networks, Rough Set Theory, Decision Trees, Mathematical Programming and Case Base Reasoning. 2.1 Neural networks in data mining In more practical terms neural networks are non-linear statistical data exhibiting tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data. Using neural networks as a tool, data warehousing firms are harvesting information from datasets in the process known as data mining. The difference between these data warehouses and ordinary databases is that there is actual manipulation and cross-fertilization of the data helping users makes more informed decisions. The neurons are organized into layers. A network which is layered in this manner consists of at least an input (first) and an output (last) layer. Between these two layers there may exist one or more hidden layers. Different kinds of NNs have a different number of layers. Self-organizing maps (SOM) have only an input and an output layer, whereas a back propagation NN has additionally one or more hidden layers [1]. http://www.ijettjournal.org Page 377 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 8 – Oct 2014 After defining the network architecture, the network needs to be trained. In backpropagation networks a pattern is applied to the input layer and a final out- put is calculated at the output layer. This yield is compared with the desired result and the errors are propagated backwards in the NN by tuning the weights of the connections. This process iterates until an acceptable error rate is reached. The backpropagation NNs have become popular for prediction and classification problems. selection and mutation on a primarily unsystematic population in order to compute a whole generation of fresh strings. SOM is a gathering and visualization method of unsupervised learning. For each input vector, only one output neuron will be activated. The winner’s weight vector is updated to correspond with the input vectors. Thus, similar inputs will be mapped to the same or neighboring output neurons forming clusters. Two commonly used SOM topologies are the rectangular lattice, where each neuron has four neighbors and the hexagonal lattice where each neuron has six neighbors. 2) Crossover, where two chromosomes reciprocally exchange some bits creating new chromosomes. It seems that NNs attract the interest of most arithmetic investigators in the area of our concern. Their structure and working principles enable them to deal with problems where an effective algorithm based solution is not applicable. Since they learn from examples and generalize to new observations they can classify previously unseen patterns. They have the ability to deal with incomplete, ambiguous and noisy data. Unlike traditional statistical techniques they do not assume a priori about the data distribution properties, neither do they assumed independent input variables. 2.2 Genetic Algorithms Genetic Algorithm (GA) was developed by Holland in 1970. This includes Darwinian evolutionary theory with sexual reproduction. GA is stochastic search algorithm exhibited on the process of usual selection, which highpoints organic fruition. GA has been successfully applied in many optimization, search, and machine learning problems. GA process in a repetition manner by producing novel populations of strings from ancient strings. Each string is the prearranged binary, real etc., version of an entrant solution. An assessment function associates a fitness measure to every string demonstrating its fitness for the problem. Normal GA uses genetic operatives such crossover, ISSN: 2231-5381 Three operators are applied to chromosomes: 1) Reproduction, where the individuals self-multiply by duplicating themselves with a probability similar to their fitness value. 3) Mutation, which works on a single chromosome by altering one or more bits. The probability of mutation is very squat. 2.3 Decision tree Decision tree is a categorizing and forecasting method, which sequentially distributes observations into mutually exclusive subcategories. The method searches for the trait that best splits the samples into discrete classes. Subcategories are sequentially separated until the subcategories are too small or no noteworthy statistical difference occurs between candidate subsets. If the decision tree becomes too large, it is finally hacked off. 2.4 Particle Swamp Optimization The examination and study on the biologic colony revealed that intelligence generated from intricate activities can deliver well-organized answers for precise optimization difficulties. Enthused by the social conduct of animals such as fish schooling and bird grouping, Kennedy and Eberhart designed the Particle Swarm Optimization (PSO) in 1995. The basic PSO model involves a group of particles stirring in a ddimensional search space. The course and distance of every particle in the hyper-dimensional space is determined by its capability and speed. Generally, the capability is principally connected with the optimization goal and the speed is reorganized as per an erudite rule. Thus, artificial neural networks (ANNs) are a non-linear statistical model based on the working of the human brain. They are influential tools for unidentified data relationship modeling. Artificial Neural Networks are able to identify the multifaceted outline between input and output variables then forecast the result of new self-governing input data. http://www.ijettjournal.org Page 378 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 8 – Oct 2014 2.5 Rough Set Theory Rough Set Theory (RST) was introduced by Pawlak (1982). RST extents set theory with the notion of an element’s possible membership in a set. Given a class C, the lower approximation of C consists of the samples that certainly belong to C. The upper approximation of C consists of the samples that cannot be defined as not belonging to C. RST may be used to describe dependencies be- tween attributes, to evaluate significance of attributes, to deal with inconsistent data and to handle uncertainty (Dimitras et al.1999)[1]. 2. APPLICATIONS: To date, data mining has become a promising solution for identifying vibrant and nonlinear relations in financial statistics. It has been applied to diverse financial areas including stock forecasting, portfolio management and investment risk analysis, prediction of bankruptcy and foreign exchange rate, detections of financial fraud, loan payment prediction, customer credit policy analysis, and so on. In this paper, we primarily focus on the first five applications in the above list, which have mostly been discussed in the literature. 3.1 Ongoing Concern and Financial Distress According to SAS 59, the auditor has to evaluate the ability of his/her customer to continue as a GC for at least one year past the balance sheet data. If there are signs that the client corporation will face financial problems, which may lead to a catastrophe, the auditor has to issue a going concern report. The assessment of the going concern status is not a tranquil task. Studies indicate that only a relative minor share of failed companies have been capable on a going concern basis (Koh 2004). To enable the auditors on the going concern report delivering task, statistical and machine learning methods have been projected. Koh (2004) compared backpropagation NN, Decision Trees and logistic regression methods in a going concern prediction study. The data sample constrained 165 going concern companies and 165 matched non going concern businesses. 6 selected financial ratios have been used as input variables. The author reported that Decision Trees outperformed the other two methods. Tan and Dihardjo (2001) built upon a previous study of Tan, which tried to predict financial distress for Australian credit unions by using NNs. In his previous study Tan used quarterly financial data and tried to predict distress in a quarter ISSN: 2231-5381 base. Tan and Dihardjo improved the method by introducing the notion of “early detector”. When the model predicts that a credit union will go distressed in a particular quarter and the union actually goes distressed in a next quarter, in a maximum of four quarters, the quarter is labeled as “Early Detector”. This improved method performed better than the previous one in terms of Type II errors rate. 13 financial ratios were used as input variables and a sample of 2144 observations was used. The results were compared with those of a Probit model and were found marginally better especially for the Type 1 error rate. Konno and Kobayashi (2000) proposed a method for enterprise rating by using Mathematical Programming techniques. The method made no distribution assumptions about the data. Three alternatives based on discrimination by hyper plane, discrimination by quadratic surface and discrimination by elliptic surface were employed. 6 financial ratios derived from financial statements were used as input variables. The data sample contained 455 enterprises. The method calculated a score for each enterprise. 3.2 Fraud Management Management fraud is the deliberated fraud committed by managers through falsified financial statements. Management fraud injures tax authorities, share- holders and creditors. Spathis (2002) developed two models for identifying falsified financial statement from publicly available data. Input variables for the first model contain 9 financial ratios. For the second model z-score is added as input variable to accommodate the relationship between financial distress and financial statement manipulation. The method used is logistic regression and the data sample contained 38 FFS and 38 non FFS firms. For both models the results show that 3 variables with significant coefficients entered the model. 3.3 Bankruptcy Prediction Predicting bankruptcy is of great benefit to those who have some relations to a firm concerned, for bankruptcy is a final state of corporate failure. In the 21st Century, corporate bankruptcy in the world has reached an unprecedented level. It results in huge economic losses to companies, stockholders, employees, and customers, together with tremendous social and economic cost to the nation. Therefore, accurate prediction of bankruptcy has become an important issue in finance. Companies are strongly demanding explanations for the logic of prediction. They find it more acceptable to hear, for instance, that the prediction is produced based on computer-generated rules than to hear that the decision is http://www.ijettjournal.org Page 379 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 8 – Oct 2014 made by an advanced technique that offers no explanation. The breakthrough bankruptcy prediction model was the Zscore model developed by Altman. The five-variable Z-score model using multiple discriminant analysis showed very strong predictive power. Since then, the discriminant analysis has been approved to be the most widely accepted and successful method in bankruptcy prediction literature. In addition, numerous studies have tried to develop different bankruptcy prediction models by applying other data mining techniques including logistic regression analysis, genetic algorithms, decision trees, classification and regression trees (CART), and other statistical methods. Those techniques can generally provide good interpretability of the prediction models. In the past two decades, a number of studies have also applied neural network approach to bankruptcy prediction, most centering on the comparison of predictive performance of neural networks and other methodologies such as discriminant analysis and logic analysis. Some have reported that the performance of neural networks is slightly better than that of other techniques, but results are contradictory or inconclusive. Although neural networks and statistical models have been used for bankruptcy prediction, they may encounter the problem of unequal frequencies of the two states of interest, which creates at least two major obstacles in evaluating the network predictive performance. The first issue involves the impact of unequal frequencies of the two states (e.g., bankruptcy versus not bankruptcy) on training a neural network or estimating the parameters of statistical models. Drawing random samples from unbalanced populations will likely yield samples that contain an overwhelming majority of one state of interest. Consequently, the decision performance of neural networks or statistical models may be poor while being tested in realistic situations. To overcome this problem, researchers have selected choice-based sampling technique in which the probability of an observation entering the sample depends on the value of the dependent variable. The second problem involves evaluating the accuracy of various decision models. The percentage of observations correctly classified can be very misleading with unbalanced samples. In general, training a neural network with balanced samples in applications such as bankruptcy prediction can enable the network to familiarize itself with the infrequent state of interest. Neural networks trained unbalanced samples provide the best results while being tested under realistic conditions. Jain and Nag constructed several training samples with different composition. They compared the performance of a neural network that was trained on a balanced sample and the ISSN: 2231-5381 performance of another neural network trained on more representative samples. The weighted efficiency measure was the highest for the former network and decreased when the networks were trained using samples representative of the population. 3.4 Credit Risk Estimation The task of credit risk analysis becomes more demanding due to the increased number of bankruptcies and the competitive offers of creditors. DM techniques have been applied to facilitate the estimation of credit risk. Huang et al. (2003) performed credit rating analysis by using Support Vector Machines (SVMs), a machine learning technique. Two data sets were used; one containing 74 Korean firms and the other containing 265 US firms. For both data sets 5 rating categories were defined. Two models for Korean data set and two models for US data set, each one having a different input vector were built. SVMs and a backpropagation NNs were used to predict credit rating. SVMs performed better in the three of the four models. Another consideration of the study was to interpret the NN. The Garson method was used to measure the relative importance of the input values. Mues et al. (2004) used decision diagrams to visualize credit risk evaluation rules. Decision diagrams have the theoretical advantage over decision trees that they avoid the repetition of isomorphic sub-trees. Two data sets, one containing German data and two containing Benelux data were used. A NN was employed to perform the classification. The rule extraction methods Neurorule and Trepan were applied to extract rules from the network. 3.5 Corporate Performance Prediction Lam (2003) developed a model to predict the return rate on common shareholders equity. She used backpropagation NNs and inferred rules from the weights of the connections by applying the GLARE algorithm. The input vector included 15 financial statement ratios and 1 technical analysis variable. In an additional experiment 11 macroeconomic variables were also included. The data sample contained 364 firms. Back et al. (2001) developed two models to cluster companies according to their performance. Both models used SOMs. The first model operated over financial data of 160 companies. By employing text mining techniques, the second model analyzed the CEOs’ annual report of the companies. http://www.ijettjournal.org Page 380 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 8 – Oct 2014 The authors concluded that there are differences between the clustering results of the two methods. Two models were developed, one analyzing financial ratios and the other analyzing the CEOs’ reports. In this study a different method, the Prototype-Matching Text Clustering, was used for analyzing the reports. By comparing the results of the qualitative and the quantitative methods the authors concluded that the text reports tend to foresee changes in the financial state before these changes explicitly influence the financial ratios. 4. CONCLUSION DM methods have categorization and forecasting abilities which can enable the decision making procedure in financial difficulties. The financial and pforecasting tasks in the composed literature address the topics of bankruptcy prediction, credit risk estimation, going concern reporting, financial distress, corporate performance prediction and management fraud. Bankruptcy prediction seems to be the most popular application area. 5. REFERENCES [1] Efstathios Kirkos, Yannis Manolopoulos: “Data Mining In Finance And Accounting: A Review Of Current Reasearch Trends” [2] Abhijit A. Sawant and P. M. Chawan : “Study of Data Mining Techniques used for Financial Data Analysis”, ISSN: 2319-5967 ISO 9001:2008 Certified (IJESIT) Volume 2, Issue 3, May 2013 [3] Boris Kovalerchuk, Evgenii Vityaev: “Data Mining For Financial Applications” [4] Gary D. Boetticher: “Teaching Financial Data Mining Using Stocks And Futures Contracts” [5] Mark K.Y. Mak, George T.S. Ho and S.L. Ting: “A Financial Data Mining Model for Extracting Customer Behavior”, Accepted 23 July 2011. The data mining methods employed in the collected literature include Genetic Algorithms, NNs, , Rough Set Theory, Decision Trees, and Mathematical Programming. Most of the researches seem to prefer the Neural Network model. Although a substantial quantity of research work has address the application of DM techniques in finance there are many fertile areas for further re- search. The introduction of hybrid models, the improvement of existing models, the extraction of comprehendible rules from Neural Networks, the improvement of performance and the integration of ERP systems with DM tools are some possible future research directions. In terms of the data used the enrichment of the input vector with qualitative information and the usage and evaluation of formal methods for feature selection and data discretization are open research possibilities. The future is open. Further research effort will improve models and methods making DM an even more valuable tool in finance and accounting. ISSN: 2231-5381 http://www.ijettjournal.org Page 381