Stat503 Dr. Cook Enhua Ma Spring, 1999 Credit Evaluation Course Project Credit evaluation is concerned with the process of assigning a score to an existing or prospective loan based on the characteristics of the applicant. In general, financial ratios are widely used to evaluate a client’s credit level. Eight key ratios, which are mainly related to the risk analysis, are used in our study to evaluate a client’s credit level. Eight measures are available. Group: Dummy variable (1: Good Credit. 2: Bad Credit) X1: The value of current assets-liabilities ratio (C. ASSET / C. LIABILITIES) X2: The proportion of w. capital in total assets (W. CAPITAL / T. ASSETS) X3: The ratio of current liabilities and total assets (C. LIABILITIES / T. ASSETS) X4: The proportion of current assets in total assets (C. ASSETS / T. ASSETS) X5: The ratio of total liabilities and total assets (T. LIABILITIES / T. ASSETS) X6: The ratio of net worth change and total assets (CH. NETWORTH / T. ASSETS) X7: The ratio of farm land value and total assets (FARMLAND VALUE / T. ASSETS) X8: The ratio of total liabilities and net worth (T. LIABILITIES / NETWORTH) The primary questions are 1. How do we distinguish the client's financial credits according to the combinations of their assets and liability information? 2. Can we do a good prediction job according to these classification rules? 1 2. Suggested Approaches Approach Data restructuring Variable transformations are needed. Reason For some of the ratios, there are high skewness and kurtosis. Normality, variances, and outlier assessments show that data restructuring are needed. Create different To allow us to inspect the group colors and glyphs for difference visually. different groups Summary Statistics To obtain the basic numerical information about the variables Visual Inspection ( under XGobi) Numerical Analysis Type of questions addressed "What are the basic structures of the variables?" Box-Plots "Which variables might be useful discriminators of the client's credit?" Bivariate Plots "Which pairs of variables can do a better job in client's credit classification?" Rotation and Grand Tour "What kinds of combinations of variables might be useful for the two-group classification?" "Can We find a good classification rule for the bank credits based on the information provided?" "What are the advantages and disadvantages of statistical methods and machine learning methods and how about their predictive power?” Linear Discriminant Analysis Classification and Regression Trees Feed-Forward Neural Network Backpropagation Net (BPN) Self-Organizing Maps (SOM) 2 3. Actual Approaches 3.1 Data Reconstructing In order to satisfy the linear discriminant analysis assumptions, variables are reconstructed as following. X1 ln(X1) X3 X3^0.5 X6 ln(X6 +1) X2 ln(X2 +1) X5 ln (X5) X8 X8^0.25 The univariate plots which are used to inspect the distributions and the transformations are presented in Appendix 1. 3.2 Summary Statistics for transferred measurements Table 1. Overall summaries of variables for training and testing group Number = 68, Number of Variables = 8 ____________________________________________________________________________ X1 X2 X3 X4 X5 X6 X7 X8 Training Mean Std.Dev. Mini. Ist Qu. Median 3rd Qu. Maxi. Skewness Kurtosis 0.67 0.95 -0.93 0.06 0.50 1.22 3.51 1.09 1.19 0.08 0.11 -0.21 0.01 0.07 0.14 0.36 0.26 0.28 0.34 0.14 0.00 0.26 0.34 0.42 0.65 -0.02 0.04 0.40 0.18 0.02 0.28 0.40 0.53 0.83 0.07 -0.42 -1.64 0.55 -3.00 -1.97 -1.71 -1.20 -0.58 -0.12 -0.13 0.04 0.09 -0.17 0.00 0.00 0.10 0.47 1.61 5.84 0.47 0.24 0.00 0.36 0.52 0.64 0.82 -0.75 -0.31 Table 2. Variable summary statistics by groups _____________________________________________________________ Group1(n=34) Group2(n=34) Training Mini. Maxi. Mean Std. Dev. Mini. Maxi. Mean Std. Dev. 0.07 3.51 1.25 0.93 -0.93 1.39 0.09 0.53 X1 0.01 0.36 0.14 0.09 -0.21 0.20 0.01 0.08 X2 0.00 0.65 0.28 0.14 0.17 0.65 0.40 0.12 X3 0.02 0.83 0.32 0.18 0.28 0.75 0.49 0.12 X4 -2.41 -0.58 -1.47 0.51 -3.00 -0.78 -1.80 0.54 X5 -0.17 0.47 0.06 0.11 -0.11 0.20 0.02 0.07 X6 0.00 0.82 0.46 0.23 0.00 0.79 0.48 0.25 X7 0.38 1.47 0.82 0.21 0.79 1.31 0.99 0.13 X8 3 0.91 0.19 0.38 0.79 0.90 1.03 1.47 0.15 0.79 3.3 Bivariate Scatterplot Figure 1. Bivarite Scatterplot (red: group 2, green: group1) There is a strong positive correlation between X4 and X8. It looks like that X7 and X8 contribute little in discriminate the two group, while the combination of X1, X2, and other variables provide good distinction between the two groups. The scatter plot was made in Xgobi.new (Swayne, Cook & Buja, 1998). 3.4 High dimensional visual inspection Distinguishing the two groups by visual inspection Grand tour and 3-d rotation does not give a good separation on this data set. 4 3.5 Linear Discriminant Analysis (LDA) Figure 2. LDA result for training set Figure 3. LDA result for testing set (green for group 1, red for group2) The linear discriminant analysis solution is given by: A = eigenvectors of W-1B, X1bar and X2bar are the means of the group 1 and group 2, respectively. Xbar is the overall mean. Since the prior probabilities are the same for both group, they are omitted. X1barT = [1.2476 0.1448 0.2838 0.3179 -1.4689 0.0606 0.4597 0.8201] X2barT = [0.0929 0.0076 0.4035 0.4903 -1.8018 0.0231 0.4797 0.9910] XbarT = [0.6703 0.0762 0.3437 0.4041 -1.6354 0.0419 0.4697 0.9056] A = -0.9500600 -11.1621542 -5.8750227 10.1964130 0.6896711 1.2087291 -2.8772790 -7.2737290 The classification rule is that X0 belongs to group 1 if (X1bar - Xbar)TAAT(X0 - Xbar)-(X1bar - Xbar)TAAT(X1ba - Xbar) (X2bar - Xbar)TAAT(X0 - Xbar)-(X1bar - Xbar)TAAT(X2ba - Xbar); X0 belongs to group 2 otherwise. Where X0 is the case need to be classified. 5 Linear discriminant analysis misclassifies 8/68 or 12 percent of the observations for the training set, which is not very satisfied. Using the classification rule we got to classify the testing set, the next year's data, we got the same misclassification error rate. Four in each group are misclassified. Figure 1 and Figure 2 showed the corresponding LDA result. 3.2 Classification and Regression Tree (CART) Figure 1 Classification tree: tree(formula = loans2.group ~ ., data = loans2d) Variables actually used in tree construction: [1] "x1" "x6" "x4" "x2" Number of terminal nodes: 7 Residual mean deviance: 0.4568 = 27.86 / 61 Misclassification error rate: 0.1029 = 7 / 68 x1, x2, x4, and x6 are actually used for CART classification. The misclassification error rate is 10 percent, which is a little better than LDA. The CART classification rule is following If X1 < 0.06 or 0.06< X1 < 0.32 and X6< 0.00990131 or X1>0.32 and X4 >0.3 and 0.135395 > X2 > 0.0615694, then the case is in group2 (bad credit). Otherwise, the case is in group 1. 6 When the data in the third year were applied as a testing set, four points are clearly misclassified. Other 8 observations are given equal weight for group1 and group2. They are observation 5, 8, 28, 32, 47, 61, 64, 68. Further examination of these eight units are required. > predict 1 2 1 28 2 2 2 28 3.1 Feed-forward Neural Network We use 8-3-1 skip layer connections with 39 weights and linear output units with decay equal 0.001 for this data set. The prediction function is stored in loans2.nn. Feed-forward Neural Network gives us almost perfect result, only 1 case is misclassified. This classification probably has a highly non-linear boundary. Visual inspection didn’t show any possibility of getting a linear perfect classification. When loans2.nn are used to classified the data set for the next year, it does an excellent job too. This may be caused by the similarity of the data set. > table(d.loans2[,2],round(predict(loans2.nn,dloans2))) 1 2 1 33 1 2 0 34 >d.loans3_read.table("loans3.dat",col.names=c("ob","group","x1","x2","x3","x4","x5","x6","x7","x8")) > loans3d_data.frame(d.loans3[,3:10]) > table(d.loans3[,2],round(predict(loans2.nn,loans3d))) 1 2 1 33 1 2 0 34 3.4 Back Propagation (BP) Nets A software called PCNeuron is used to do the Back Propagation and SOM analysis. This is a neural network development shell developed by professor I-Cheng Yeh at the Chung-Hwa Institute, Taiwan. BP is one of the most popular neural networks used in selected business applications. It is widely used in stock price prediction, bank loan evaluation, bankruptcy prediction, and other classification problems. P. Werbos initiated the early framework of BP in 1974, in which he proposed a method of modifying connection weights for the neurons in the hidden layers. In 1985, D.Parker and Rumelhart, D., Hinton, G. E., and William, based on independent studies, simultaneously revealed the concepts and computational procedure of BP. BP uses a simple gradient steepest descent algorithm, searching for minim on a specified error surface, using small steps of fixed size (specified by the learning rate). The algorithm is following: 7 1. 2. 3. 4. Determine the network structure, system’s parameters, and initial connection weights. Select a training pair from the training set. Apply the input vector to the network. Calculate the output of the network Sj = ai Wji + a0 Wj0 ; ai = f( Sj ) Sk = aj Wkj ; ak = f( Sk ) 5. Calculate the error between the calculated output and the desired output (backward): ek = ( tk - ak ) f( Sk ) 6. Repeat steps 1-5 for each training pair until the error of the entire set is acceptably low. For this project, one hidden layer BP is used. There are 8 input units, 4 hidden units, and 2 output units. Figure 4: The Structure of BPN Net Connecting Weights Input layer [7.360 Hidden layer [-6.888 Output layer [-0.795 3.814 8.619 4.325 -8.152 -2.088 14.106 -0.111 14.610 -0.485] 12.164] 0.795] There are 4 out of 68 or 6 percent of the cases were misclassified for both training and testing set. This is a pretty good result, although it is not as good as the feed-forward neural network. When putting the prediction back into the data set, it looks that X1, X2, X5, and X7 contribute most for the discrimination. Self-Organizing Map (SOM) Self-organizing map (SOM) network was developed by Teuvo Kohonen between 1979 and 1982. It can be used to classify items into appropriate categories of similar objects. The SOM net was inspired by the fact that the relative positions of small groups of neurons in the brain reflect some physical relationship to the sensory signals. SOM net is one of the most popular unsupervised learning nets. It not only can work stand-alone, but also can serve as a front-end for other 8 networks. The primary use of the SOM is to visualize topologies and hierarchical structures of higher dimensional input spaces. The structure of SOM net contains two layers. The first layer, input layer, uses a linear transfer function. It input signal can be binary or continuous variables. The second layer, a competitive layer, is used to represent clusters of the input data. The two layers are fully connected to each other. SOM net used the “winner-tale-all” strategy to update the connection weights. That is, for each iteration, only the output neuron that wins the competition by being closest to the input vector is activated and allows modifying its connection weights. The learning procedure works as follows: 1. Determine the network’s parameters and initialize the weights Wkj between input neuron j and output neuron k to small random values. 2. Present a new input vector x and compute the distance between the input vector and each output neuron k: n Outk = (Xij - Xkj)2 , k= 1,2,…,c. j=1 3. Select the output neuron k which has the smallest distance: Out k* =min (Out K) k 4. Update the activation values of output neurons based upon the winner-take-all strategy: Yk = 1 if k = k*; otherwise Yk = 0 5. Modify the connection weights for all units k within a specified neighborhood of k* n Wkj’ = Wkj + (Xij - Xkj)2 , kNBD k*(t); 1 j n. j=1 6. Update the larning rate 7. Repeat the above steps until no more input vectors. Figure 5: The Basic Structure of SOM Net 9 For this data set, 2 by 2, 3 by 3, and 4 by 4 grids are used. The result maps are in figure 5. We can see that SOM did a poor job. Almost half of the points are misclassified. Although none of the three maps gives us a good result, the more the output units, the better the data structure is described. Figure 5 SOM results (red for group2, green for group 1) 10 4. Summary of Findings 1. For the credit evaluation problem, neural network methods are generally perform better than statistical methods. Feed-forward neural net gives us a nearly perfect result. Given the highly non-linear boundary obtained by neural net methods, I recommend that CART rule and Feed-forward rule be used together to get better prediction results. The CART classification rule is: If X1 < 0.06 or 0.06< X1 < 0.32 and X6< 0.00990131 or X1>0.32 and X4 >0.3 and 0.135395 > X2 > 0.0615694, then the case is in group2 (bad credit). Otherwise, the case is in group 1. The discriminant rule for Feed-forward method is generated by S-Plus function and stored in loans2.nn to classify new points. 2. We can distinguish very well the client’s financial credits according to the combinations of their assets and liability information provided in this data set. Since the cost of classifying a customer with good credit to bad credit group is much less than classify a customer with a bad credit to the good credit group, the sheer number or rate of misclassification should not be the only criterion to judge different classification rules. The direction of misclassification matters in such situation. 3. Among the measures, X1, X2, X4, and X6 contribute most in classifying the two groups. 4. There are several points have much higher asset/liabilities ratio. They are 35,39,47,49, and 52. Data inspection shows that they are not outliers. They are valid data points corresponding to conservative customers with low risks. 11 Reference Kohonen, T. 1990. “The self-organizing Map”. Proceedings of IEEE. V.78(9):1471-1480 Ripley, B. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, U.K Venables, W. N. and Ripley, B. 1994. Modern Applied Statistics with S-Plus. Springer-Verlag, New York 12