The Application of Panel Data Mining Based on Gene Schema in Predicting Finance Distress Feng-ning Ma1, Ji-ting Yang2, Shi-qiang Jiang3, Qin-yu Ren3 1College of Management and Economics, Tianjin University, Tianjin, China 2College of Management and Economics, Tianjin University, Tianjin, China 3College of Management and Economics, Tianjin University, Tianjin, China (sophia-yangji@163.com) Abstract - Various different methods has been applied in the field of predicting finance distress, including statistical analysis, neural network technologies, genetic algorithm, logistic analysis etc. Although these classical methods have good performance in the prediction of financial distress, there still exist some other disadvantages. As financial data should be panel ones, most investigations only focus on one year’s financial data to interpret the underlying statistic model, which hence may fail to characterize the business failure tendency of ST companies. In comparison, panel data combines cross-section data with time series data so that it can provide researcher with a huge amount of data as well as multi-dimension perspectives. By utilizing panel data based on the binary gene expressions, this article aims at constructing a dynamic prediction model which can explore multiple years’ financial data. By resorting to the dynamic thresholding techniques, the marginal value during discretization can be properly derived by a relative floating on the corresponding industry average value. Relying on the discrete expression, the period gene can be identified from the provided time binary sequence, which can be then explored to recognize ST company. Numerical simulation has demonstrated that our new method can significantly improve the prediction accuracy of realistic financial data, which is of great significance to both theoretical analysis and realistic applications. Keywords - Panel data, period gene schema, financial distress I. INTRODUCTION During the last four decades, the issue of financial distress prediction has been extensively investigated, which has been evolved from the primary statistical methods to the more appealing intelligent techniques. Among the hot research topics, predicting corporate failure has long being remained as an important issue, since it affects the interested decision making including stockholders, creditors, senior management and auditors. The icebreaking work on the prediction of financial failure was initialized by the U.S. William Beaver [1] in 1966, in which a simple univariate analysis was adopted. Then, in 1968 Edward Altman [2] employed a linear discriminant analysis based on the multivariable model to analyze the financial corporate failure of several companies. Lately, in 1980 Ohlson [3] used the appealing logit model to identify the significant 9 statistic variables that have heavy impact on firm failures. In 1992, Tam and Kiange [4] applied the back-propagation neural networks (BPNN) to fanatical failures prediction, and concluded that BPNN performed better than the other methods. From then on, an important trend has been emerged which build the sophisticated soft computing architectures or hybrid intelligent strategies to the problem analysis [5]. Jie Sun [6] proposed a decision tree model combining attributeoriented induction, information gain and decision tree for financial distress prediction. Myoung-Jong [7] put forward an ensemble with neural network in the bankruptcy prediction field, which has been proved more accuracy. Philippe du Jardin [8] improved the prediction accuracy of neural network based model using a set of variables selected with a criterion. Lili Sun [9] built naive Bayes Bayesian network models for bankruptcy prediction using operational guidance. Plenty of techniques have been introduced to address the problem of financial prediction in recent years. Unfortunately, most of these investigations on prediction can only rely on the static model, other than a more efficient dynamic model. Thus, focusing on the shortrange periodic financial data, e.g. the historic data of the (t-1)th year to construct the prediction state of the tth year, which basically ignore the long-term historic data. In this work we deal with the financial prediction as a dynamic statistic model based on panel data which can characterize the business failure tendency of financial distress companies. Panel data combines time-series data and cross-sectional data together and construct bi-dimension data based on time and space. Since panel data contains multitude records, it shows better performance in degree of freedom than cross-sectional data. Generally speaking, there are two ways in dealing with data mining: the one is establishment of continuous regression function, such as logit model and neural network, the other is to divide samples into different classes, for example clustering algorithm. Both theory and experiments have demonstrated that the latter performance excellently in certain cases, especially in dealing with high-dimension data. Our research is devoted to develop a more efficient complete discrete mode, in which each realistic instance is treated as a discrete module and then the financial prediction is conducted in a more efficient way. II. OUR METHOD In 1993, for the first time, Tichy and Sherman [10] proposed the concept of “corporate DNA”. This theorem believes that, each corporate has its unique genes as human beings, and it is such an innate property determines the fundamental stable pattern, development tendency and variation. In this work, the financial information of company is regarded as a special corporate gene. Based on this new perspective, we may identify the common gene corresponding to the critical schema of financial distress, which is practically in sharp contrast to the situations of good-runned financial, and hence establish an efficient financial prediction model. In our paper, single year’s financial information is converted in to binary variables which will be regarded as the individual gene according to the well-designed conversation criterion. This investigation put forward a novel concept of “period gene schema” which contains several constantly years’ individual gene, and expresses a company gene schema in a time series.. A. The principle of the schema As the most popular coding scheme in genetic algorithms, binary coding employs the binary set {0, 1} as its coding notation set. That is, the gene expression of each population can be viewed as a binary string. Furthermore, we may add a redundant element "*", which can be referred as to wildcard and practically can be used as either the binary “0” or “1”. Thus, the above binary notation set can be generalized to a ternary set, i.e., {0,1 , *}, on which the element string such as {0110 , **0110, 1110*01**} can be generated. The binary string generated from the ternary set {0 ,1 , *}, which can depict similar structures, is referred as to a pattern. For example, the pattern *1* can describe the 4-elements subset [010, 110, 111, 011]. Accordingly, for a binary coding string with wildcard notation, there may totally involve 3L patterns when the string length is L. Based on this binary coding with wildcard, we may construct the binary expression from the realistic continuous data according to specific criterion. And then, the combination in time dimension is performed.. B. Mapping rule The first aroused question is that how we can obtain the discrete gene expression from the provided continuous financial data. With regard to the unsupervised discretization, the widely adopted method may include the equal-width and equal-frequency discretization. To be specific, in equal-width method we calculate the fixed width of each box with an equal width given a prescribed boxes number. Assume the original continues region is denoted by [a, b], then the derived equal-width sub-space can be expressed into[a, a+(b-a)/N],[a+(b-a)/N, a+2(ba)/N],…[b-(b-a)/N, b]. In this research, an improved equal-width discretization is presented. Instead of directly dividing the fixed-width space for each box, during the discretization a marginal value is adopted with which the subfield can be obtained. This marginal value is denoted with A, which is derived from average financial data for each industry by increasing or decreasing it. The extent of rise or fall will be decided in the following repeated experiments. Then the two states can be determined correspondingly through a comparison with A. Specifically, the resulting state is set to “0” when kij is smaller than A, while it is “1” if kij surpasses A. Thus, we have: R=0 while kij ∈ (-∞,A]. (1) R=1 while kij ∈ [A,+∞). (2) C. Definition of binary period gene schema ts: the sth time, (s=1…P) Kj: the jth finance index, j=1…n Kij: the jth finance index of the ith company Rij: the gene schema of the jth finance index for the ith company, is the mapping of Kij, f: Kijts --Rijts Rijts∈ {0,1},the mapping rules are (1) and (2). Xi={ Ri1t1…Ri1tp , Ri2t1…Ri2tp ,……,Rint1 …Rintp } which is composed of the total 26 indexes pattern Rij during the successive three years (i.e. 2004-2006). In this investigation, P can be empirically set to 3, which means the financial data of the past three years is utilized. Taking the Kelon Electric Appliance Company Limited for example, the total 26 financial indicators of 2002-2004 are compared with the marginal value which derived from mean value of specific industry. When the financial indicator is smaller than the corresponding marginal value, we have R=0; and otherwise, we may set R=1. As a consequence, the binary expression of company gene of the Kelon between the year 2004 and 2006 can be derived which also can be regarded as “period gene schema” of the Kelon Electric Appliance.. D. Building of prediction model Based on the statistic technique, a prediction pattern is extracted which can efficiently distinguish the potential distressed company from those healthy ones. Due to the fact that three years’ financial data are analyzed in this paper, each of index has eight kind of gene schema (000,001,010,011,100,101,110,111) We calculate the percentage of each gene schema of each index in ST sample and Non-ST sample respectively, which can be regarded as the response ratio of each index gene schema. We established the new prediction model by using this developed response ratio which generally exhibits a high percentage in distressed companies while usually shows a low percentage in healthy companies. Then under different marginal values, different predicting models can be achieved. Depends on accuracy of each model, the one which has the first-rate results would be the best predicting model. III. RESULTS property A. The selection of sample Since the promulgation of company bankruptcy law, in 1986 the listed company which has gone to bankruptcy seems barely, thereby it is relatively difficult to construct the more promising analysis samples. Alternatively, the special treatment (ST) companies which be warned by China Securities Regulatory Commission is widely adopted in the most domestic existing investigations. Hence, a similar strategy is used in our analysis in which the ST companies can be thought of distressed ones while those without any special treatment (Non-ST) can be regarded as the healthy one. Besides, in order to eliminate the effects coming from different industry, the industry mean value is served as the critical value. Also, the ratio index is adopted to minimize the impact from different size. As a result, after getting rid of the companies with data deficiency and data singularity, the total 460 companies are selected as the ST and Non-ST instances with a main focus on 2006-2008. The whole sample is divided into two subsamples as the test set and the prediction set. The former contains 230 including 115 ST and 115 Non-ST companies while the later is consisted of 230 including 115 ST and 115 Non-ST companies. Considering the provided data of ST Company may exhibit noticeable fluctuations, the financial data of the years exactly before the firstly special treated are used. For example, we may construct the financial schema of a company, which has been ST firstly in 2006, by using the earlier data from 2002 to 2004.. cash flowing earning development operating C. Prediction of model TABLE II. index current ratio B. The choice of financial index From the most classical literatures, 3 financial indexes have been highlighted by Beaver in [10].In 1968, Altman employed 5 indexes in the so-called Z-score model. And lately, in 1977 he extended the total number of financial indexes to 7 in his improved model. Ohlson employed 9 significant variables in the new logit model. By combining these famous indexes which have been adopted by most other investigations, in this work we may use 26 fanatical indexes which can reflect the most property of a company, such as the ability of short repaying, long repaying, cash flowing, earning, development and operating. Table 1 embodies the indexes. TABLE I. property ability of short repaying ability of long repaying index The debt-equity ratio, the stockholders' equity ratio, Long Term Debt to Total Asset Ratio the ratio of cash flow and liabilities, cash ratio net profit on sales, net profit on total assets, the ratio of net assets and net profit, the ratio of Operating profits and Costs and expenses, the growth rate of fixed assets, the growth rate of total assets, the growth rate of net profit, the ratio of operating ratio, the ratio of management Expenses and main business income, financial ratio, Fixed Asset Ratio inventory turnover Fixed asset turnover current assets turnover the assets turnover the stockholders' equity turnover financial ratio the ratio net working capital and total assets schema 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 TABLE OF SCHEMA RATIO Ration of ST 0.1724 0.0172 0.0086 0.0517 0.0431 0.0172 0.0172 0.8103 0.7586 0.0431 0.0086 0 0.0603 0.0086 0.0603 0.0517 0.431 0.0431 0.0086 0.0086 0.0603 0.0259 0.1034 0.3103 Ration of Non-ST 0.0259 0.0172 0 0.0345 0.0345 0.0259 0.069 0.6379 0.5259 0.0431 0.0259 0.0345 0.0948 0.0172 0.0431 0.2069 0.2414 0.0172 0.0172 0.0259 0.0345 0.0172 0.0517 0.5862 Difference between ST and Non-ST 0.1465 0 0.0086 0.0172 0.0086 -0.0087 -0.0518 0.1724 0.2327 0 -0.0173 -0.0345 -0.0345 -0.0086 0.0172 -0.1552 0.1896 0.0259 -0.0086 -0.0173 0.0258 0.0087 0.0517 -0.2759 TABLE OF INDEXES index current ratio, quick ratio the ratio net working capital and total assets, debt ratio, the ratio of Long-term liabilities and net working capital, Table 2 has illustrated the response ratio of some ST and Non-ST companies under each financial index when marginal value is 15% up to the average one. Ration of ST represents the percentage of each schema under different indexes in ST samples, as the same, ration of Non-ST is the percentage of each schema under different indexes for Non-ST samples. Difference between ST and Non-ST means the difference value between Ration of ST and Ration of Non-ST for the same schema under same index. Take “current ratio” for example, response ratio of schema “000” is 17.24% for ST companies in comparison to 2.59% for Non-ST companies. Among the eight schema of index “current ratio”, the schema “111” has the greatest difference between ST and Non-ST samples, so we can safely come to the conclusion that “111” is the best schema of “current ratio” under 15% up marginal value. Based on our repeatedly empirical experiments, we may choice the marginal value rising 15% over the exactly mean value, and choice the schemas with their ST ratio being 15% larger than those of Non-ST. Table 3 expresses different prediction on different marginal value according to various extent to the change of very value of each industry. The resulting significant schemas include the 111 pattern of financial expenses ratio, the 000 pattern of current ratio and the 111 pattern of financial ratio, with which the prediction accuracy can be improved to 86.09% and 80% from the experiments. There also has significant prediction for the combination of the 000 pattern of current ratio and the 111 pattern of financial ratio, which can achieved 75.65%. TABLE III. Extent to the very value 0% up 15% down 15% up 20% down 20% prediction model, however, any other advantaged techniques can be applied in, such as genetic algorithm. Furthermore, the schema which contains “*” also could be researched in the prediction model. REFERENCES [1] [2] [3] [4] [5] [6] TABLE OF ACCURACY [7] index shcema accuracy current ratio quick ratio financial ratio current ratio and quick ratio current ratio financial ratio current ratio and financial ratio current ratio financial ratio current ratio and financial ratio current ratio financial ratio current ratio and financial ratio current ratio financial ratio 000 000 111 0.7826 0.7130 0.8067 000,000 0.6700 000 111 0.8609 0.8000 000,111 0.7565 000 111 0.7217 0.8435 000,111 0.6700 000 111 0.8522 0.8087 000,111 0.7304 000 111 0.6348 0.8170 IV. CONCLUSION Traditional methods for constructing stable prediction model often lie in single year data, which cannot embody the trend before ST. According to schema theory of genetic algorithm, this paper presents a panel data mining method based on binary variables. We can see that the prediction of the new method no less than classic ones, meanwhile the principle is simple, it can be helpful to providing qualified information for interest-related parts. There still are some aspects could be improved. This research uses statistic method when establish the [8] [9] [10] Beaver,W.H., “Financial ratios as predictors of failure,” Journal of Accounting Research. Chicago, vol. 4(supplement.), pp. 71-111, 1966. Altman E I. “Financial Ratios, Discriminant Analysis and Prediction of Corporate Bankruptcy,” Journal of Finance. Pennsylvania, vol. 9, pp. 589-609, 1968. James A Ohlson, “Financial Ratios and the Probabilistic Prediction of Bankruptcy,” Journal of Accounting Research. Chicago, vol. 18, pp. 109-131. 1980. Tam,K.Y., Kiang,M.Y., “Managerial applications of neural networks:The case of bank failure predictions,” Management Science. Pennsylvania, vol. 38, pp. 926-947. 1992. P. Ravi Kumar, V. Ravi, “Bankruptcy prediction in banks and firms via statistical and intelligent techniques – A review,” European Journal of Operational Research. Canterbury, vol. 180, pp. 1-28, 2007 Jie Sun, Hui Li, “Data mining method for listed companies’financial distress prediction,” Knowledge-Based Systems. Vol 21, pp. 1-5, 2008. Myoung-Joung Kim, Dae-Ki Kang, “Ensemble with neural networks for bankruptcy prediction,” Expert Systems with Applications. vol. 37, pp. 3373-3379, 2010. Philippe du Jardin, “Predicting bankruptcy using neural networks and other classification methods: The influence of variable selection techniques on model accuracy,” Neurocomputing, vol. 73, pp. 2047-2060. 2010. Lili Sun, Prakash P.Shenoy, “Using Bayesian networks for bankruptcy prediction:Some methodological issues,”.European Journal of Operational Research. Canterbury, vol. 180, pp. 738-753, 2007. Tichy. N.M., Sherman. S., “Control Your Destiny or Someone Else Will,” Doubleday/Currency, New York, NY.1993