DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Project report: Data Mining for Credit Worthiness Team 1: Hilda, Amer, Gary Abstract Evaluate loan applicant credit worthiness by learning from historical data by using data mining algorithms and tools. Description People apply for loans, and banks or lenders need to determine the applicants “credit worthiness”, based on an “A, B, C, no credit” scale (where A is good and worthy, B is more of a risk, C is risky, and no credit is not worthy of a loan). There are 100 attributes to consider. This is a difficult problem for many loan officers because there are many possible combinations of attributes. [book] We used the Weka [weka] data mining tool to evaluate loan applicant credit worthiness by learning from historical data. We tried several data mining algorithms [book] [algo] as well as iterative training [book]. We then evaluated how correct the tools’ guesses for the credit scores compared to the actual credit scores. Data and Methodology We used data from the “easy data mining” website [data] which has data in many categories for data mining testers. The data we chose was used for determining the credit worthiness of loan applicants based on a large number of attributes. We converted the data, as shown in Figure 1: Source data with credit scores, into a format for the Weka tool. We used the data with the credit scores as training data. To learn the tool, we used classifiers on the training data to show us the “confusion matrix” (basically, where the mistakes were made) as shown in Fall 2012 DCS860A Emerging Topics in Computer Security Page 1 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Figure 3: The Confusion Matrix from evaluating training data, and (when the algorithm created one) the decision tree, as shown in Figure 4: Visualizing the decision tree. We removed the credit scores (but kept them aside) for the test data. After using the data mining tool to come up with credit scores based on the applicant data in the test file, as shown in Figure 2: Running a test with training data, and them massaging the output, as shown in Figure 5: Capturing the Test Data, we compared the tools credit scores with the actual credit score, so that we could determine the number of correct evaluations done by the tool as a percentage, shown in Figure 6: Determining the success rate of the instances. We used different algorithms to evaluate the test data. The results are shown in “Observations” below. For some of the algorithms, we were able to see the decision trees that were generated. Tools/Models We used the Weka tool. We used the Naïve Bayes, J48, IB1, and Ordinal Class classifiers. [book] [help] As explained below out of all these classification algorithms we found J48 [algo] as the better model for our prediction. This model is an open source Java implementation of C4.5 algorithm. Below is a short description of this algorithm and its pseudo code: Algorithm C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i = {x_1, x_2, ...} is a vector where x_1, x_2, ... represent attributes or features of the sample. The training data is augmented with a vector C = {c_1, c_2, ...} where c_1, c_2, ... represent the class to which each sample belongs. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists. This algorithm has a few base cases. All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value. Pseudocode In pseudocode, the general algorithm for building decision trees is[2]: 1. 2. 3. 4. 5. Check for base cases For each attribute a a. Find the normalized information gain from splitting on a Let a_best be the attribute with the highest normalized information gain Create a decision node that splits on a_best Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node Fall 2012 DCS860A Emerging Topics in Computer Security Page 2 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Observations Training/Test records, algorithms Table 1: Observations Training records Test Records 50 100 200 500 2000 2000 2000 2000 IB1 34% 27% 30% Refined 200 300trimmed 300 2000 - Algorithm Naïve Bayes 33% 47% 37% 44% J48 43% 54% 35% 40% 48% 59% 37% - Ordinal Class 33% 35% 40% - The J48 algorithm seemed to work out the best, as shown in Table 1: Observations. We also “retrained”, or iteratively trained [book] [help]. That is, we took 200 records to tested 300 records and got 48% correct. Then we trimmed the 300 record output and kept only those records for which we correctly predicted (about 144 records). We then used the trimmed file as training for the next 2000, and by getting 59% correct; we increased the successful rate by 11%. We found the J48 algorithm also retrained the best; Naïve Bayes did not improve at all, in our test. We did not want to retrain too many times or we would over-fit the model. As shown in Figure 6: Determining the success rate of the instances, we used 50 examples to train for 2000 tests, we were able to correctly identify the credit rating 43% of the time. We learned a great deal about the data mining algorithms and the “feel” of data mining (that is, successes, iteration, pruning, and other techniques). The Weka tool is very useful, but a lot of manual effort was required to massage the data, often necessitating conversion to different data formats to allow us to edit or modify the data. Summary We used the Weka tool to assist in using data mining techniques to help us seek out better, simpler procedures to determine “credit worthiness” of loan applicants. We evaluated Weka’s results with the known results to determine our success rate. We tried different classification algorithms. We “retrained” by using the revised results from previous tests to try to improve the success rate. Some algorithms did not appear to be very accurate. Also, surprisingly, the algorithms did not seem to improve much with more training data. We feel that the J48 algorithm, with the highest training accuracy and retraining success, worked out the best. Fall 2012 DCS860A Emerging Topics in Computer Security Page 3 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project References [algo] Detailed description of J48 algorithm: http://en.wikipedia.org/wiki/C4.5_algorithm [book] Data Mining - Practical Machine Learning Tools and Techniques Third Edition, Ian H. Witten Eibe Frank - Mark A Hall, ISBN-10: 0123748569 [data] Easy.Data.Mining at http://www.easydatamining.com/index.php?option=com_content&view=article&id=22&Ite mid=90&lang=en [weka] Weka Data Mining Tool at http://www.cs.waikato.ac.nz/ml/weka/ [help]Weka supplied documentation, installed at wekapgmdir/Weka-3-6/documentation.html Appendix A: Screen captures of the Weka Data Mining process Figure 1: Source data with credit scores Fall 2012 DCS860A Emerging Topics in Computer Security Page 4 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Figure 2: Running a test with training data Fall 2012 DCS860A Emerging Topics in Computer Security Page 5 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Figure 3: The Confusion Matrix from evaluating training data Figure 4: Visualizing the decision tree Fall 2012 DCS860A Emerging Topics in Computer Security Page 6 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Figure 5: Capturing the Test Data Fall 2012 DCS860A Emerging Topics in Computer Security Page 7 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Figure 6: Determining the success rate of the instances Fall 2012 DCS860A Emerging Topics in Computer Security Page 8 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project Appendix B: Procedure Procedure: 1. Process the “.csv” file a. Start Weka b. Select “Explorer” c. Select “Open file…” i. Select file type “CSV data files” ii. Select training file, e.g. 100Training.arff iii. Select “open” d. Process the fields i. Select the check boxes for 1. “our company” 2. “our copyright” 3. “our product” 4. “our URL” 5. “do not remove” ii. Keep “row” and all other fields iii. Select “Remove” e. Select “Save…” i. Remove the “.csv” from the filename ii. Keep “.arff” as the filetype iii. Select “Save” f. You might need to copy the @attribute lines from another file if you get an “incompatible” message when processing the training file 2. Run the tests a. Start Weka b. Select “Explorer” c. Select “Open file…” i. Select training file, e.g. 100Training.arff d. Select “Classify” tab e. Select “Choose” i. Select the test, 1. Bayes->NaiveBayes, 2. Lazy->IB1 3. Meta->OrdinalClassClassifier, 4. Trees->J48, f. Select “Supplied Test Set” g. Select “Set…” i. Select “Open file…” 1. Select the Test file, e.g. 2000Testing.arff 2. Select “Open” Fall 2012 DCS860A Emerging Topics in Computer Security Page 9 of 10 DPS 2014 Team 1: Amer Ali, Gary Fisher, Hilda Mackin Data Mining Project ii. Select “Close” h. Select “Start” i. Wait for the bird (bottom right corner of the Weka page) to stop moving j. Right click the last entry in “result list” k. Select “View Classifier Errors” i. Select “Save” 1. Give name, e.g. 100TestingResults_NaiveBayes.arff 2. Select “Save” which will close “view” window ii. Select “X” to close “Visualize…” window l. Select the next test and iterate e through k above 3. Process the results a. Open Weka Explorer b. Select “Preprocess” Tab c. Select “Open file…” to get the results.arff file we just created i. Select the results.arff file ii. Select “Open” iii. Select “Save…” 1. Change the name, remove the “.arff” 2. Select a “.csv” file type 3. Select “Save” d. Iterate 15 above for all tests 4. Select “X” to Exit Weka 5. Process the output 1. Open the results.csv file (in Excel) 2. Go to the last column, probably “CT” a. Enter in “CT2”: =IF(VLOOKUP(A2,Creditworthiness.csv!$A$2:$CX$2501,102,TRUE)=CR2,1,0) b. Copy that from CT2 to CT1692 (where 1692 is the last row of data) c. In CT1, enter: =SUM(CT2:CT1692) d. In CU1, enter: =ROWS(CT2:CT1692) e. In CV2, enter: =CT1/CU1 That gives the successful classification rate for this classifier rule 3. Save the file a. Select “X” b. Select “Save” to “Do you want to save the changes…” c. Select “Yes” to “…may contain features…” 4. Change the name to add the percentage Fall 2012 DCS860A Emerging Topics in Computer Security Page 10 of 10