Data Mining for Credit Worthiness

advertisement
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Project report:
Data Mining for Credit Worthiness
Team 1: Hilda, Amer, Gary
Abstract
Evaluate loan applicant credit worthiness by learning from historical data by using data mining
algorithms and tools.
Description
People apply for loans, and banks or lenders need to determine the applicants “credit worthiness”,
based on an “A, B, C, no credit” scale (where A is good and worthy, B is more of a risk, C is risky, and no
credit is not worthy of a loan). There are 100 attributes to consider. This is a difficult problem for many
loan officers because there are many possible combinations of attributes. [book]
We used the Weka [weka] data mining tool to evaluate loan applicant credit worthiness by learning
from historical data. We tried several data mining algorithms [book] [algo] as well as iterative training
[book]. We then evaluated how correct the tools’ guesses for the credit scores compared to the actual
credit scores.
Data and Methodology
We used data from the “easy data mining” website [data] which has data in many categories for data
mining testers. The data we chose was used for determining the credit worthiness of loan applicants
based on a large number of attributes.
We converted the data, as shown in Figure 1: Source data with credit scores, into a format for the Weka
tool.
We used the data with the credit scores as training data. To learn the tool, we used classifiers on the
training data to show us the “confusion matrix” (basically, where the mistakes were made) as shown in
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 1 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Figure 3: The Confusion Matrix from evaluating training data, and (when the algorithm created one) the
decision tree, as shown in Figure 4: Visualizing the decision tree.
We removed the credit scores (but kept them aside) for the test data. After using the data mining tool
to come up with credit scores based on the applicant data in the test file, as shown in Figure 2: Running
a test with training data, and them massaging the output, as shown in Figure 5: Capturing the Test Data,
we compared the tools credit scores with the actual credit score, so that we could determine the
number of correct evaluations done by the tool as a percentage, shown in Figure 6: Determining the
success rate of the instances.
We used different algorithms to evaluate the test data. The results are shown in “Observations” below.
For some of the algorithms, we were able to see the decision trees that were generated.
Tools/Models
We used the Weka tool. We used the Naïve Bayes, J48, IB1, and Ordinal Class classifiers. [book] [help]
As explained below out of all these classification algorithms we found J48 [algo] as the better model for
our prediction. This model is an open source Java implementation of C4.5 algorithm. Below is a short
description of this algorithm and its pseudo code:
Algorithm
C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information
entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i = {x_1,
x_2, ...} is a vector where x_1, x_2, ... represent attributes or features of the sample. The training data is
augmented with a vector C = {c_1, c_2, ...} where c_1, c_2, ... represent the class to which each sample
belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of
samples into subsets enriched in one class or the other. Its criterion is the normalized information gain
(difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the
highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the
smaller sublists.
This algorithm has a few base cases.



All the samples in the list belong to the same class. When this happens, it simply creates a leaf node
for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a decision node higher
up the tree using the expected value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the
tree using the expected value.
Pseudocode
In pseudocode, the general algorithm for building decision trees is[2]:
1.
2.
3.
4.
5.
Check for base cases
For each attribute a
a. Find the normalized information gain from splitting on a
Let a_best be the attribute with the highest normalized information gain
Create a decision node that splits on a_best
Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 2 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Observations
Training/Test records, algorithms
Table 1: Observations
Training records
Test Records
50
100
200
500
2000
2000
2000
2000
IB1
34%
27%
30%
Refined
200
300trimmed
300
2000
-
Algorithm
Naïve Bayes
33%
47%
37%
44%
J48
43%
54%
35%
40%
48%
59%
37%
-
Ordinal Class
33%
35%
40%
-
The J48 algorithm seemed to work out the best, as shown in Table 1: Observations.
We also “retrained”, or iteratively trained [book] [help]. That is, we took 200 records to tested 300
records and got 48% correct. Then we trimmed the 300 record output and kept only those records for
which we correctly predicted (about 144 records). We then used the trimmed file as training for the
next 2000, and by getting 59% correct; we increased the successful rate by 11%. We found the J48
algorithm also retrained the best; Naïve Bayes did not improve at all, in our test. We did not want to
retrain too many times or we would over-fit the model.
As shown in Figure 6: Determining the success rate of the instances, we used 50 examples to train for
2000 tests, we were able to correctly identify the credit rating 43% of the time.
We learned a great deal about the data mining algorithms and the “feel” of data mining (that is,
successes, iteration, pruning, and other techniques). The Weka tool is very useful, but a lot of manual
effort was required to massage the data, often necessitating conversion to different data formats to
allow us to edit or modify the data.
Summary
We used the Weka tool to assist in using data mining techniques to help us seek out better, simpler
procedures to determine “credit worthiness” of loan applicants. We evaluated Weka’s results with the
known results to determine our success rate. We tried different classification algorithms. We
“retrained” by using the revised results from previous tests to try to improve the success rate. Some
algorithms did not appear to be very accurate. Also, surprisingly, the algorithms did not seem to
improve much with more training data. We feel that the J48 algorithm, with the highest training
accuracy and retraining success, worked out the best.
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 3 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
References
[algo] Detailed description of J48 algorithm: http://en.wikipedia.org/wiki/C4.5_algorithm
[book] Data Mining - Practical Machine Learning Tools and Techniques Third Edition, Ian H. Witten Eibe Frank - Mark A Hall, ISBN-10: 0123748569
[data] Easy.Data.Mining at
http://www.easydatamining.com/index.php?option=com_content&view=article&id=22&Ite
mid=90&lang=en [weka] Weka Data Mining Tool at http://www.cs.waikato.ac.nz/ml/weka/
[help]Weka supplied documentation, installed at wekapgmdir/Weka-3-6/documentation.html
Appendix A: Screen captures of the Weka Data Mining process
Figure 1: Source data with credit scores
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 4 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Figure 2: Running a test with training data
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 5 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Figure 3: The Confusion Matrix from evaluating training data
Figure 4: Visualizing the decision tree
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 6 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Figure 5: Capturing the Test Data
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 7 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Figure 6: Determining the success rate of the instances
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 8 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
Appendix B: Procedure
Procedure:
1. Process the “.csv” file
a. Start Weka
b. Select “Explorer”
c. Select “Open file…”
i. Select file type “CSV data files”
ii. Select training file, e.g. 100Training.arff
iii. Select “open”
d. Process the fields
i. Select the check boxes for
1. “our company”
2. “our copyright”
3. “our product”
4. “our URL”
5. “do not remove”
ii. Keep “row” and all other fields
iii. Select “Remove”
e. Select “Save…”
i. Remove the “.csv” from the filename
ii. Keep “.arff” as the filetype
iii. Select “Save”
f. You might need to copy the @attribute lines from another file if you get an
“incompatible” message when processing the training file
2. Run the tests
a. Start Weka
b. Select “Explorer”
c. Select “Open file…”
i. Select training file, e.g. 100Training.arff
d. Select “Classify” tab
e. Select “Choose”
i. Select the test,
1. Bayes->NaiveBayes,
2. Lazy->IB1
3. Meta->OrdinalClassClassifier,
4. Trees->J48,
f. Select “Supplied Test Set”
g. Select “Set…”
i. Select “Open file…”
1. Select the Test file, e.g. 2000Testing.arff
2. Select “Open”
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 9 of 10
DPS 2014 Team 1:
Amer Ali, Gary Fisher, Hilda Mackin
Data Mining Project
ii. Select “Close”
h. Select “Start”
i. Wait for the bird (bottom right corner of the Weka page) to stop moving
j. Right click the last entry in “result list”
k. Select “View Classifier Errors”
i. Select “Save”
1. Give name, e.g. 100TestingResults_NaiveBayes.arff
2. Select “Save” which will close “view” window
ii. Select “X” to close “Visualize…” window
l. Select the next test and iterate e through k above
3. Process the results
a. Open Weka Explorer
b. Select “Preprocess” Tab
c. Select “Open file…” to get the results.arff file we just created
i. Select the results.arff file
ii. Select “Open”
iii. Select “Save…”
1. Change the name, remove the “.arff”
2. Select a “.csv” file type
3. Select “Save”
d. Iterate 15 above for all tests
4. Select “X” to Exit Weka
5. Process the output
1. Open the results.csv file (in Excel)
2. Go to the last column, probably “CT”
a. Enter in “CT2”:
=IF(VLOOKUP(A2,Creditworthiness.csv!$A$2:$CX$2501,102,TRUE)=CR2,1,0)
b. Copy that from CT2 to CT1692 (where 1692 is the last row of data)
c. In CT1, enter:
=SUM(CT2:CT1692)
d. In CU1, enter:
=ROWS(CT2:CT1692)
e. In CV2, enter:
=CT1/CU1
That gives the successful classification rate for this classifier rule
3. Save the file
a. Select “X”
b. Select “Save” to “Do you want to save the changes…”
c. Select “Yes” to “…may contain features…”
4. Change the name to add the percentage
Fall 2012 DCS860A
Emerging Topics in Computer Security
Page 10 of 10
Download