Applications of Machine Learning: Consumer Credit Risk Analysis Danny Yuan

Applications of Machine Learning: Consumer
Credit Risk Analysis
by
Danny Yuan
B.S. Electrical Engineering and Computer Science and B.S.
Mathematics, Massachusetts Institute of Technology (2013)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
Copyright © 2015 Massachusetts Institute of Technology, All rights
reserved.
The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and
electronic copies of this thesis document in whole and in part in any medium now known or
hereafter created.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 14, 2015
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Andrew W. Lo
Charles E. and Susan T. Harris Professor; Professor of Electrical
Engineering and Computer Science; Professor of Sloan School of
Management
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
John Guttag
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Professor Albert R. Meyer
Chairman, Master of Engineering Thesis Committee
2
Applications of Machine Learning: Consumer Credit Risk
Analysis
by
Danny Yuan
Submitted to the Department of Electrical Engineering and Computer Science
on May 14, 2015, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Current credit bureau analytics, such as credit scores, are based on slowly varying
consumer characteristics, and thus, they are not adaptable to changes in customers
behaviors and market conditions over time. In this paper, we would like to apply machine-learning techniques to construct forecasting models of consumer credit
risk. By aggregating credit accounts, credit bureau, and customer data given to us
from a major commercial bank (which we will call the Bank, as per confidentiality
agreement), we expect to be able to construct out-of-sample forecasts. The resulting models would be able to tackle common challenges faced by chief risk officers
and policymakers, such as deciding when and how much to cut individuals account
credit lines, evaluating the credit score for current and prospective customers, and
forecasting aggregate consumer credit defaults and delinquencies for the purpose of
enterprise-wide and macroprudential risk management.
Thesis Supervisor: Andrew W. Lo
Title: Charles E. and Susan T. Harris Professor; Professor of Electrical Engineering
and Computer Science; Professor of Sloan School of Management
Thesis Supervisor: John Guttag
Title: Professor of Electrical Engineering and Computer Science
Acknowledgements
Over the years I have been very fortunate to attend MIT and have the pleasure to
work with a few great people. I would like to express my sincere gratitude to the
following people.
First and foremost, I want to give a big thanks to BigData@CSail for providing
funding for this research project and providing me with a research assistantship. My
education would not have been complete and this paper would not have been possible
without their generous support.
Next, I would like to thank my supervisors, Andrew Lo and John Guttag, for giving me such an engaging and interesting topic to work with. This project has beautiful
elements from both artificial intelligence, the field of concentration for my Masters
degree, and in finance, my career interest. My supervisors are more knowledgeable in
their respective areas of expertise than anyone I know. I was given insights into the
mixture of the two fields that I would not have gotten elsewhere.
I would also like to thank David Fagnan, a graduate student who has been supervising me on this project, for his patience and his sharing of his vast knowledge
in the field of consumer credit risk. There were many hurdles over the course of the
paper, without his advice and words of encouragement, the project would not have
been possible.
Finally, I would like to thank my parents and my sister for their sacrifices made
along the way to ensure the greatest education I can ever have. At times, I have
certainly been so absorbed with academics and work that I neglected other parts of
my life. Thank you for always being there for me.
5
6
Contents
Cover page
1
Abstract
3
Acknowledgements
5
Contents
7
List of Figures
9
1 Introduction to Consumer Credit Risk Modeling
11
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2
Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3
Defining the Prediction Problem based on Datasets . . . . . . . . . .
13
1.4
Outline of this Paper . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2 Data Preparation
17
2.1
Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
Splitting into Training and Testing sets . . . . . . . . . . . . . . . . .
18
3 Feature Selection
3.1
3.2
21
Methods for Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.1
Selection via Random Forest . . . . . . . . . . . . . . . . . . .
22
3.1.2
Selection via Logistic Regression . . . . . . . . . . . . . . . . .
26
The List of Selected Features . . . . . . . . . . . . . . . . . . . . . .
26
7
4 Using Machine Learning Algorithms for Credit Scoring
4.1
4.2
4.3
29
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.1.1
Building the Model . . . . . . . . . . . . . . . . . . . . . . . .
30
4.1.2
Interpreting the Model . . . . . . . . . . . . . . . . . . . . . .
33
Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2.1
Building the Model . . . . . . . . . . . . . . . . . . . . . . . .
36
4.2.2
Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . .
38
Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3.1
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3.2
Building Random Forest Models . . . . . . . . . . . . . . . . .
45
4.3.3
Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . .
50
5 Closing Remarks
53
5.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.2
Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
Bibliography
55
8
List of Figures
1-1 Distribution of the year which delinquent customers become delinquent 14
1-2 Distribution of Statements data by year and month, given to us by the
Bank, as of March 2015 . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2-1 The datasets and its most relevant features . . . . . . . . . . . . . . .
20
3-1 Importance values returned by Random Forest . . . . . . . . . . . . .
23
3-2 Graph of importance values returned by Random Forest . . . . . . .
24
3-3 Error rate as the number of trees increase . . . . . . . . . . . . . . .
25
3-4 The model summary returned by R . . . . . . . . . . . . . . . . . . .
27
3-5 Description of selected features . . . . . . . . . . . . . . . . . . . . .
28
4-1 The training set’s distribution of scores returned by logistic regression
by positive (delinquent, Dt = 1) and negative (non-delinquent, Dt =
0) examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4-2 Enrichment and recall returned by logistic regression as a function of
threshold for the training set . . . . . . . . . . . . . . . . . . . . . . .
32
4-3 Confusion matrix of the resulting classifier on the test set at threhold
0.02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4-4 Evaluation of our model . . . . . . . . . . . . . . . . . . . . . . . . .
32
4-5 The model coefficients . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4-6 ROC curves for logistic regression. Training set is represented in red
and testing set is represented in green. . . . . . . . . . . . . . . . . .
9
35
4-7 A table showing the number of support vectors for different values of
soft penalty error and class weights . . . . . . . . . . . . . . . . . . .
36
4-8 A table showing the rate of training error for different values of soft
penalty error and class weights . . . . . . . . . . . . . . . . . . . . . .
37
4-9 A table showing the rate of cross validation error for different values
of soft penalty error and class weights . . . . . . . . . . . . . . . . . .
37
4-10 A table showing computed precision value for different values of soft
penalty error and class weights . . . . . . . . . . . . . . . . . . . . . .
39
4-11 A table showing computed recall value for different values of soft penalty
error and class weights . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4-12 A decision tree to classify delinquent customer accounts by the variable
OutstandingCapital . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4-13 Accuracy and run time data for CART algorithms as a function of
density and tree size . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4-14 Relative error for different tree sizes for density = 0.0001 . . . . . . .
43
4-15 ROC curve on testing data for densities shown in Figure 4-13. The red
line corresponds to density = 0.01 and the green curve corresponds to
density = 0.001. The other densities overlap each other. . . . . . . . .
44
4-16 ROC curve on training data for various values of number of trees: 1
(black), 10 (red), 100 (green), and 1000 (blue) . . . . . . . . . . . . .
46
4-17 ROC curve on testing data for various values of number of trees: 1
(black), 10 (red), 100 (green), and 1000 (blue) . . . . . . . . . . . . .
47
4-18 ROC curve on training data for various values of minimum node size:
1 (black), 10 (red), 100 (green), and 1000 (blue) . . . . . . . . . . . .
48
4-19 ROC curve on testing data for various values of minimum node size: 1
(black), 10 (red), 100 (green), and 1000 (blue) . . . . . . . . . . . . .
49
4-20 AUC values on testing sets for different values of number of trees and
minimum node size . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4-21 AUC values on training sets for different values of number of trees and
minimum node size . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
50
Chapter 1
Introduction to Consumer Credit
Risk Modeling
This chapter gives an overview of the consumer credit risk markets and why it is an
important topic to study for purposes of risk management. We introduce the data
we used for this paper and the prediction problem we are trying to solve. Finally we
describe how we reformatted the data and how we split the data into training and
testing sets.
1.1
Introduction
One of the most important considerations in understanding macroprudential and
systemic risk is delinquency and defaults of consumer credit. In 2013, consumer
spending encompassed approximately 69 percent of U.S. gross domestic product. Of
the $3.098 trillion of outstanding consumer credit in the United States in the last
quarter of 2013, revolving credit accounts for over 25 percent of it ($857.6 billion). A
1% increase in accuracy of identifying high-risk loans could prevent losses for over $8
billion. Because of the risks inherent in such a large portion of our economy, building
models for consumer spending behaviors in order to limit risk exposures in this sector
is becoming more important.
Current models of consumer credit risk are usually customized by the needs of
12
CHAPTER 1. INTRODUCTION TO CONSUMER CREDIT RISK MODELING
the lending institutions and credit bureaus. Credit bureaus collect consumer credit
data in order to compute a score to evaluate the creditworthiness of the customer;
however, these metrics do not adapt quickly to changes in consumer behaviors and
market conditions over time. [KKL10] We propose to use machine-learning techniques
to analyze datasets provided to us by a major commercial bank (which we shall
refer to as the ‘’Bank” as per our confidentiality agreements). Since consumer credit
models are relatively new in the space of machine learning, we plan to compare the
accuracy and performance of a few techniques. There are many ways to do feature
selections, splitting of data into training and testing sets, regularization of parameters
in algorithms to handle overfitting and underfitting, and various kernel functions to
handle higher order non-linearity in the features. We will explore many of these.
By aggregating proprietary data sets from the Bank, we will train and optimize
different models in order to predict if a customer will be delinquent in their payments
for the next quarter. We will then test the models with out of sample data.
1.2
Credit Scoring
The goal of credit scoring is to help financial institutions decide whether or not to
lend to a customer. The resulting test is usually a threshold value that helps the
decision maker make the lending decision. Someone above the threshold is usually
considered a credit worthy customer who is likely to repay the financial obligation.
Someone below the threshold is more at risk, and less likely to repay the financial
obligation. Such tests can have a huge impact on profitability for banks; a small
fraction of a percentage improvement will have tremendous impact. [Ken13]
There are two types of credit scoring, application scoring and behavioral scoring.
Application scoring is done at the time for application and estimates the probability of
default over some periods. Behavioral scoring is used after credit has been granted,
and generally is used to monitor an individual’s probability of default over some
periods. This paper focuses on the latter of the two; in particular, we will try to
monitor the Banks’ customers and create metrics to flag customers that have become
1.3. DEFINING THE PREDICTION PROBLEM BASED ON DATASETS
risky over time. [Ken13] Given these new metrics, the Bank could take actions against
flagged customers, such as closing their accounts or suspending further loans.
1.3
Defining the Prediction Problem based on Datasets
We were given 5 datasets by the Bank. We shall call these data sets: Accounts, Banks,
Customer, Records, and Statements. Respectively, they contain roughly 270,000 observations of 130 variables, 84,000 observations of 230 variables, 85,000 observations
of 20 variables, 53,000 observations of 30 variables, and 8,900,000 observations of 40
variables. Accounts contains data on the state of the bank accounts, Bank contains
data provided by the credit bureau from all banks about customers‘ loans, Customers
contains a set of data about the customers, Records contains data collected at the
time of loan application, and Statements contain information about credit card transactions, but only from 2013.
The field we are interested in is a boolean variable in the Accounts data set for
whether the customer account is delinquent or not. In Figure 1-1, we can see the
distribution of the year on which the delinquent customers become delinquent on
their payment. We have the most delinquencies data in 2013, 2014 and the first
quarter of 2015. Further looking into the Statements data in Figure 1-2, we see that
the amount of Statements data jumped almost 10-fold from April 2014 to May 2014.
This makes us think that the data itself is inconsistent, because of the sharp increase
in the number of transactions between these two months. One possible explanation
for this inconsistency could be a change in definition in what a transaction is. We
decided to throw away the data after March 2014 and do predictions for each quarter
of data we do keep.
The prediction problem we are trying to answer is:
Given the credit card transactions (Statements) data in the previous quarter, aggregated with data from Accounts, Banks, Customers, and Records data, will the
customer account be delinquent in payment for the next quarter?
13
14
CHAPTER 1. INTRODUCTION TO CONSUMER CREDIT RISK MODELING
Figure 1-1: Distribution of the year which delinquent customers become delinquent
1.4. OUTLINE OF THIS PAPER
Month
1
2
3
4
5
6
7
8
9
10
11
12
2013
29995
28551
33326
35441
38200
39299
42962
44161
45197
54067
53460
59457
Distribution of Statements Data
Year
2014
2015
60431
927933
59737
871427
65054
196444
68008
0
675209
0
685470
0
848955
0
848028
0
924835
0
982490
0
915826
0
1015323
0
Figure 1-2: Distribution of Statements data by year and month, given to us by the
Bank, as of March 2015
In order to answer this question, we have to prepare our data so that we could run
machine-learning algorithms. Since we are trying to do prediction on the next quarter
based on transactions from the previous quarter, we have to first split the data sets
into quarters and then merge Statements data from one quarter with Accounts data
from the next quarter.
1.4
Outline of this Paper
The following is the outline of this paper:
• Chapter 2 This chapter describes how we prepared the data to be used for machine learning. We will discuss how we dealt with missing values and how we
divided the data into training and testing sets.
• Chapter 3 This chapter talks about how we chose to select the most significant
features to be used for machine learning. We will explore two techniques used
15
16
CHAPTER 1. INTRODUCTION TO CONSUMER CREDIT RISK MODELING
to do this, logistic regression and random forest.
• Chapter 4 This chapter goes over the three machine learning techniques used to
explore our prediction problem. We will discuss the results and interpret the
findings returned by logistic regression, support vector machines, and random
forest.
• Chapter 5 This chapter concludes our findings and discusses what can be done
next as the next steps in building better consumer credit risk models.
Chapter 2
Data Preparation
In this chapter, we talk about what we did to clean the data into a useable format and
how we defined our training and testing sets. We identify some issues we discovered
with the data and how we treat missing values. We also talk about how we manipulate
the data to answer our prediction problem.
2.1
Data Cleaning
In order to answer our prediction question using machine learning, we need to manipulate the data to fit the input format for the algorithms. To do so, we cleaned
each of the datasets before merging them together. Many of the fields in the data
sets were missing, and we handled them in a special manner. For discrete variables,
we replaced the field with a value of 0 and created a new column called FIELD NA,
a binary dummy variable (0 or 1) to indicate whether the original field was missing.
For continuous values, we replaced the missing value with the average of available
data and added a binary dummy variable to indicate that it is missing.
In the accounts data set, there are about two dozens different account/product
types (since much of the data are hashed to protect the confidentiality of the customer,
we don‘t actually know what these types mean). Since we only have statements data
from 2013 and want to use it to do prediction, we remove all delinquencies before
2013. After removing all these accounts and accounts that are closed, we were left
18
CHAPTER 2. DATA PREPARATION
with about 99,000 workable accounts and roughly 34,000 unique customers. We also
generated new fields such as RemainingPayments (amount left to pay), DelinquencyCounts (number of times delinquent in data sets), and QuarterDelinquent (the
quarter in which it was delinquent on its payment).
The customers data set was pretty straightforward. We kept the columns that are
meaningful, for example, age, marital status, gender, occupation, education, credit
card marker, mortgage marker, loan marker and revolving credit marker. Columns
we dropped include minimum subsistence, installment loan marker, and postal code.
The reason we dropped them was either the Bank told us that they are irrelevant or
the data itself is mostly meaningless or un-interpretable.
In the banks dataset, there are many repeated rows, empty columns, and a lot
of NA cells. The most useful data we extracted from this dataset are the number of
inquiries (a row is generated whenever an inquiry is made) and the credit score for
the customers. Although we had no credit scores for some of the customers, we were
able to handle such observations in the manner described earlier.
For the records dataset, we are missing data on some customers, so we handled it
as described earlier. After cleaning the data, we aggregated all the data by merging
them by their customer id field.
Finally in the statements data, we aggregated the total amount debited and credited in each quarter, and we created new columns, CreditCount and DebitCount,
which represent the number of credit and debit transactions.
2.2
Splitting into Training and Testing sets
Since we are trying to build a prediction model for delinquency rates, we first need
data to build the model. We also need data to test whether our model makes correct
predictions on new data. The first set of data is called the training set, the latter
set of data is called the testing set, also known as hold-out set. The training set is
the data that we feed into the machine-learning model (logistic regression, support
vector machines, random forest.etc). The test set is the set of data we feed into our
2.2. SPLITTING INTO TRAINING AND TESTING SETS
trained model to check the accuracy of our models.
Since we are trying to predict delinquency for the next quarter based on statements
data in the previous quarter, we split the statements data into quarters. January,
February and March will be considered quarter 1, April, May and June will be considered quarter 2, and so on. We then merged the data in such a way that we have
customer data for quarter 2 and statements data for quarter 1 in 1 row. After this,
we create training and testing sets by creating a new column with random values
between 0 and 1, and pick a threshold. We picked 0.7 because we want there to be
roughly 70% of our data in the training set and the rest in the testing set. We are also
careful about making sure all customers are either in the training set or testing set.
After we moved all the customer accounts chosen in the testing set into the training
set, we ended up with roughly 80% of our data in the training and 20% of our data
in the testing set.
Cleaning the data was probably the most time-consuming part of the project,
because there were a lot of questions we needed to answer before it could be done.
Figuring out what the prediction problem is given the limited amount of data we have,
understanding what each of the fields mean through translation of the variables, and
communicating with our client took up most of one semester. However, once that was
out of the way, we were ready to move onto the next phrase of the project, feature
selection.
19
20
CHAPTER 2. DATA PREPARATION
Figure 2-1: The datasets and its most relevant features
Chapter 3
Feature Selection
In this chapter, we explore ways to find features to use for training our machine
learning models. We have shown the features provided to us in the data set by the
Bank in Figure 2-1. However, many of the features contribute little to no help in
predicting whether the customer account will eventually go delinquent. As such, it is
important to find the features that are relevant in order to bring down the run time
of the algorithms and give us a better understanding of which features are actually
meaningful.
3.1
Methods for Selection
We present two widely used methods in selecting features of significance in the machine learning literature. The methods come from supervised learning, the task of
inferring a function from previously labeled examples. It uses training data to infer
a function, which could be used to map values for data in the testing set. Logistic
regression is commonly used to do prediction on bankruptcy. [FS14] Random forest
also provides importance values to input features. We will use these two methods to
do feature selection.
22
CHAPTER 3. FEATURE SELECTION
3.1.1
Selection via Random Forest
Random forest is an ensemble learning method for classification that trains a model
by creating a multitude of decision trees and outputting the mode of the classes returned by the individual trees. The algorithm applies the technique of bootstrapping
aggregation, also known as bagging, to tree learners. Given a training set X = x1 , , xn
and output Y = y1 , , yn , bagging repeatedly selects a sample at random with replacement from the training set and fits trees to the samples:
For b = 1, ..., B:
1. Sample, with replacement n training examples from Xb , Yb
2. Train a decision tree fb on the samples.
After training, predictions for unseen examples are made by averaging the predictions or taking the mode in the case of decision trees.
Random forest follows the general scheme of bagging but differs by using a modified tree-learning algorithm that selects, at each iteration of the learning process,
a random subset of features. We use random forest to do feature selection because
of its ability to rank the importance of the variables. To measure the importance
of the variables, we first fitted a random forest to the data. During this step, the
classification error rate for each data point is recorded and averaged over the forest.
To measure the importance of a variable, each predictor variable is then permuted
and the error rate is recomputed. The importance score of a variable is computed
by first averaging the differences in error rate before and after the permutation of
the predictor variables, and then dividing the result by the standard deviation of the
differences. The list of importance values can be seen in Figure 3-1. In Figure 3-2, we
plotted the importance values of the variables. The most important features appear
to be OutstandingCapital, DebitAmount, and CreditAmount. In Figure 3-3, we have
plotted the error rate of random forest as we increase the number of trees/samples.
As expected, the error decreases as we increase the number of trees/samples in our
bootstrapping procedure.
3.1. METHODS FOR SELECTION
Figure 3-1: Importance values returned by Random Forest
23
24
CHAPTER 3. FEATURE SELECTION
Figure 3-2: Graph of importance values returned by Random Forest
3.1. METHODS FOR SELECTION
Figure 3-3: Error rate as the number of trees increase
25
26
CHAPTER 3. FEATURE SELECTION
3.1.2
Selection via Logistic Regression
Logistic regression is one of the most important generalized linear models. It is most
often used to predict the probability that an instance is part of a class. In our case,
we want to use logistic regression with lasso regularization to predict the probability
that a customer account will default in the next quarter by predicting values that
are restricted to the (0,1) interval. It does so by modeling the probability in terms of
numeric and categorical inputs by assigning coefficients. By running the summary()
command in R, we are able to see which coefficient values are statistically significant.
Looking at Figure 3-4, we see that OustandingCapital, Age, CreditAmount, DebitAmount, OptimalLoanValue, OptimalLoanTerm, LoanAmount and CreditCount
are some of the significant features. Two of the Scoring category dummy variables
were also significant. Interestingly, these were all in the top 10 features returned by
Random Forest. The only one which logistic regression did not include is DebitCount.
3.2
The List of Selected Features
Combining the results returned by random forest and logistic regression, we have
decided to take all the significant features, including DebitCount, into our training
models. A brief description of the selected features is listed in Figure 3-5.
Since we are doing predictions based on statements data from the previous quarter
to predict delinquency in the next quarter, it is reassuring to see that CreditAmount,
DebitAmount, CreditCount, and DebitCount are significant features for our prediction problem. These features were generated by aggregating the data in the statements dataset. Furthermore, LoanAmount, OutstandingCapital, OptimalLoanTerm,
and OptimalLoanValue were features related to the loan itself, so it is reasonable
that they are included in the selected features. Finally, Age and Scoring were also
included as well: studies have shown that Age is correlated to income and scoring is
based on the creditworthiness of the customer.
3.2. THE LIST OF SELECTED FEATURES
Figure 3-4: The model summary returned by R
27
28
CHAPTER 3. FEATURE SELECTION
Feature name
Age
LoanAmount
OutstandingCaptial
Scoring
OptimalLoanTerm
OptimalLoanValue
CreditAmount
DebitAmount
CreditCount
DebitCount
Selected Features
Type
Categorical
Integer
Integer
Categorical
Integer
Float
Float
Float
Integer
Integer
Description
Category of age (6 distinct)
Amount of loan
Amount of outstanding principal
Hashed scoring string based on model
Number of months for optimal loan term
Log of amount of optimal loan amount
Total amount credited in account for quarter
Total amount debited in account for quarter
Total number of credit transactions for quarter
Total number of debit transactions for quarter
Figure 3-5: Description of selected features
Chapter 4
Using Machine Learning
Algorithms for Credit Scoring
This chapter explores the results of three machine learning algorithms: logistic regression, support vector machines, and random forest. These methods are chosen because
of their suitability as solutions to the low-default portfolio problem. In this chapter,
we look at the accuracy of these models by fine-tuning the threshold values on classifier inputs. The comparative assessment of these classification methods is subjective
and will influenced by the Bank’s preferences to precision and recall values. As such,
we present a range of values and allow the reader to assess the tradeoffs.
4.1
Logistic Regression
For our classification problem of finding delinquent customers, logistic regression is
the classic model for binary classifications. It is usually the go-to for users and offers
a baseline for other machine learning algorithms. Logistic regression doesn’t perform
well with a large number of features or categorical features with a large number of
values, but it still predicts well when working with correlated features.
30
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
4.1.1
Building the Model
In Figure 4-1, we see the distributions of the scores returned by logistic regression
for the delinquent and non-delinquent examples. The distribution of non-delinquent
customers is more heavily weighted towards 0 than that of the delinquent customers.
Our goal is to use the model to classify examples in our testing set. Therefore, we
should assign high scores to delinquent customers (Dt = 1) and low scores to nondelinquent customers (Dt = 0). Both distributions seem to be concentrated towards
0, meaning the scores for both instances seem to be low. The distribution for the
non-delinquent customers decays to 0 faster than that of the delinquent customers.
From this, we can see the model subpopulations where the probability of delinquency
is higher.
For logistic regression to solve our classification problem, we have to choose a
threshold for the examples to be positive and negative. When picking a threshold, we
have to consider the tradeoffs between precision and recall. Precision is the fraction of
predicted positive examples that are actually positive. Recall is the number of actual
positive examples our model can find. From looking at Figure 4-2, we see that a high
threshold would make more precise classifications but identify fewer cases while a low
threshold would identify more cases at the cost of identifying false positives. The user
should consider his preference to precision and recall when making this tradeoff. For
us, choosing a threshold in the neighborhood of 0.02 seems best, since we will have a
roughly 50 % recall while maintaining a high precision.
We evaluated our model on the test set and the results are shown in Figure 4-3.
Our model turns out to be a low precision classifier with a precision of 0.1020408
but it was able to find 55.56% of the delinquent examples in the test set, at a rate
9.852608 times higher than the overall average. Next, we will see how well the model
fits our data, and what we can learn from the models.
4.1. LOGISTIC REGRESSION
Figure 4-1: The training set’s distribution of scores returned by logistic regression
by positive (delinquent, Dt = 1) and negative (non-delinquent, Dt = 0) examples
31
32
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Figure 4-2: Enrichment and recall returned by logistic regression as a function of
threshold for the training set
predicted
non-delinquent
delinquent
actual
non-delinquent delinquent
816
4
44
5
Figure 4-3: Confusion matrix of the resulting classifier on the test set at threhold
0.02
precision
recall
enrich
p-value
pseudo R-squared
0.1020408
0.555556
9.852608
2.164682e-86
0.2047119
Figure 4-4: Evaluation of our model
4.1. LOGISTIC REGRESSION
Figure 4-5: The model coefficients
4.1.2
Interpreting the Model
The coefficients in Figure 4-5 returned by logistic regression tell us the relationships
between the input variables and the output variables. Every categorical variable is
expanded into n − 1 variables where n is the number of possible values of that variable. Negative coefficients indicate that the variables are negatively correlated to the
probability of the event, whereas positive coefficients indicate that the variables are
positively correlated to the probability of the event. Furthermore, we should only
consider variables that are statistically significant. As discussed in Section 3.1.2, only
two of the Scoring dummy categorical variables are significant.
The coefficient for Gender is 0.148443151, so that means that the odds of a male
customer being delinquent are exp(0.148443151) = 1.16 times more likely compared
to the reference level (females). For a female with a probability of delinquency of 5%
(with odds p/(1 − p), 0.05/0.95 = 0.0526), the odds for a male with all else characteristics equal are 0.0526 ∗ 1.16 = 0.061016, which corresponds to delinquent probability
of odds/(1 + odds) = 0.061016/1.061016, which is roughly 5.74%.
For numerical variables such as CREDITCOUNT, we can similarly interpret the
coefficients. The coefficient on CREDITCOUNT is -0.047937984. This means that every credit transactions lowers the odds of delinquency by exp(−0.047937984) = 0.953.
33
34
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
For a male customer with no transactions, the odds of going delinquent for a similar
customer with 10 credit transactions would be about 0.061016 ∗ 0.95310 = 0.037702,
which corresponds to delinquent probability of 3.63%.
Logistic regression models are trained by maximizing the log likelihood of the data
given the training data, or minimizing the sum of the squared residual deviances. Deviance measures how well the model fits the data. We calculated the p-value to see
how likely it is that the reduction in deviance was by chance. Another interpretation
of the p-value is the probability of obtaining the results when the null hypothesis is
true. A low p-value will give significant meaning to our results. The p-value turned
out to be 2.164682e-86, which is very unlikely to occur by chance. We also computed
the pseudo R-squared based on deviances to see how much of the deviance is explained
by the model. The model explained 20.4% of the deviance.
We also drew the ROC curves in Figure 4-6. The ROC (receiver operating characteristic) curve graphs the performance of the model of various threshold values. It
plots the true positive rate against the false positive rate. Notice that the curve for
the testing set is rather “steppy,” because of the small fraction of delinquent customer
accounts in our data set. We have also calculated the area under the curve. For the
training test, it is 0.8713662. For the testing set, it is 0.7985788.
4.2
Support Vector Machines
Support vector machines use landmarks (support vectors) to increase the power of
the model. It is a kernel-based classification approach where a subset of the training
examples is used to represent the kernel. The kernels help lift the classification
problem into a higher dimension space where the data could be as linearly separable
as possible. It is useful for classifications when similar data are likely to belong to the
same class and where it is hard to know the interaction or combinations of the input
variables in advance. It was also found to be relevant in many financial applications
recently. [ML05]
4.2. SUPPORT VECTOR MACHINES
Figure 4-6: ROC curves for logistic regression. Training set is represented in red
and testing set is represented in green.
35
36
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
0.01
Class weights 0.1
1
10
100
1
3607
3601
3684
6607
23832
Support Vectors
Soft margin penalty
10
3045
3057
3201
6071
17660
100
2668
2654
2771
5245
11858
1000
2605
2606
2668
4418
Figure 4-7: A table showing the number of support vectors for different values of
soft penalty error and class weights
4.2.1
Building the Model
There are many kernel functions we could use but the radial basis function kernel or
Gaussian kernel is probably best for our classification problem because of the unknown
relationships of our input variables. The Gaussian kernel on two samples x and x0 is
defined to be
K(x, x0 ) = exp(−
kx − x0 k2
)
2σ 2
The feature space of this kernel is infinite [Mur12], since the Taylor expansion is:
∞
X (xT x0 )j
kx − x0 k2
1
1
)
=
exp(− kxk2 )exp(− kx0 k2 )
K(x, x ) = exp(−
2
2σ
j!
2
2
j=0
0
There are two parameters to SVM that we can adjust to get the most accurate
model. They are the soft margin penalty and the class weights. The soft margin
penalty allows us to specify our preferences for not moving training examples over
getting a wider margin. This depends on if we want a complex model that applies
weakly to all data or a simpler model that applies strongly to a subset of data. We
will explore both simple and complex models. The class weights allow us to specify
the cost of getting a false positive and false negative. We will also explore various
values of penalization of the two.
4.2. SUPPORT VECTOR MACHINES
Class weights
0.01
0.1
1
10
100
1
0.01357
0.01358
0.0125223
0.01845
0.1468711
37
Training Error
Soft margin penalty
10
0.01166
0.01166
0.01109853
0.01665
0.1193
100
0.0091
0.00917
0.008937148
0.01299
0.07712
1000
0.0065
0.0065
0.006644984
0.00921
Figure 4-8: A table showing the rate of training error for different values of soft
penalty error and class weights
0.01
Class weights 0.1
1
10
100
1
0.01401
0.014
0.01353436
0.02072
0.1544
Cross Validation Error
Soft margin penalty
10
0.01431
0.0143
0.01428915
0.02425
0.1253
100
0.01547
0.0156
0.01590162
0.02555
0.0879
1000
0.01827
0.0183
0.01804582
0.0272
Figure 4-9: A table showing the rate of cross validation error for different values of
soft penalty error and class weights
38
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
4.2.2
Evaluating the Model
We tested a range of values for the soft margin penalty C, namely 1, 10, 100, and
1000. We also tested a range of values for the ratio of the class weights W we assign
to delinquent and non-delinquent customer accounts, namely 0.01, 0.1, 1, 10, and
100.
We have shown the values for the number of support vectors in Figure 4-7. As
seen in the table, the number of support vectors increase with higher ratio assigned
to the class rates of non-delinquent customers over delinquent customer accounts.
The number decreases when we increase the soft margin penalty. This makes sense
intuitively because a higher margin for error would lead to a wider margin and fewer
support vectors.
In Figure 4-8, we show the training error for the same set of values for soft penalty
error and class weights. The training error decreases as we increase the soft margin
penalty. This is expected; the soft margin penalty parameter C is a regularization
term use to control overfitting. As C increases, it is better to reduce the geometric
margin and have more training errors. As C decreases, it is better to lower the
training error and increase the geometric margin. As the ratio of non-delinquent to
delinquent increases, the training error also increases. This is because the data is
mostly non-delinquent. When the cost of incorrectly predicting delinquent customers
goes up, there are fewer penalties in predicting non-delinquent customers, so we make
more training errors.
Figure 4-9 shows the cross validation error for the testing sets. We typically want
to pick C that has the lowest cross validation error. For class weight ratios of 0.01,
0.1, 1, and 10, a lower soft margin penalty (C < 1) seems to produce higher accuracy,
whereas for the class weight ratio of 100, a large soft margin penalty (C > 100) would
produce more accurate models.
The precision rate is shown in Figure 4-10. Precision, also known as positive
predictive value, is defined as the fraction of predicted positive values that are actually
positive. For class weight ratios of 0.01, 0.1, and 1, precision increases with higher
4.2. SUPPORT VECTOR MACHINES
Class weights
0.01
0.1
1
10
100
1
0.0264
0.0264
0.1013216
0.48017
0.7533
39
Precision
Soft margin penalty
10
0.0793
0.0793
0.1585903
0.4405
0.6122
100
0.1762
0.1762
0.2026432
0.3789
0.489
1000
0.2159
0.2203
0.2599119
0.3127
Figure 4-10: A table showing computed precision value for different values of soft
penalty error and class weights
Class weights
0.01
0.1
1
10
100
1
1
1
0.6052632
0.39636
0.0868
Recall
Soft margin penalty
10
0.5625
0.5625
0.6428571
0.3984
0.0825
100
0.4706
0.4819
0.4893617
0.3333
0.0901
1000
0.405
0.4098
0.437037
0.2639
Figure 4-11: A table showing computed recall value for different values of soft
penalty error and class weights
values of soft margin penalty C, while for class weight ratios of 10, and 100, precision
decreases with higher values of soft margin penalty C. This is because of the low
delinquency rates in our population. When the cost of making mistakes on delinquent
customer accounts becomes high, we want to widen our geometric margin to account
to capture more of the delinquent customer accounts.
Finally, we show the recall values in Figure 4-11. Recall, also known as sensitivity,
is the fraction of positive examples that the model is able to identify. High sensitivity
can be achieved by assigning all examples to be positive; however, such models would
not be very useful. Recall is also inversely proportional to precision; therefore a high
recall value usually implies a low precision value and a low recall value usually implies
a high precision value. Typically, people like to aim for a precision of 0.5, but this is
a tradeoff that the user himself should consider.
40
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
4.3
Random Forest
Random forest is a special technique that combines decision trees and bagging. In
Chapter 2.1.1, we used random forest to help us select which features are significant
for our use in training machine learning models. In this section, we will fine-tune
the random forest parameters to create an accurate model to solve our prediction
problem. Before we go into further details of the random forest model, we have to
understand why decision trees were sufficient.
4.3.1
Decision Trees
Decision trees are a simple model that makes prediction by splitting the training data
into pieces and basically memorized the result for each piece. Also called classification
and regression trees or CART, it is an intuitive non-parametric supervised learning
model that produces accurate predictions by easily interpretable rules. The rules
can be written in plain English and can be easily interpreted by human beings. The
transparency of the models makes them very applicable to economic and financial
applications. Furthermore, it can handle both continuous and discrete data. To
avoid cluttering, we have shown a small example of a decision tree created by CART
for just one variable in our dataset, OutstandingCapital in Figure 4-12.
The size of the CART trees are determined by the density parameter. The density
parameter, also known as cost complexity factor, controls the size of the tree by
requiring the split at every node to decrease the overall lack of fit by that factor.
Figure 4-13 shows the effects of varying the density parameter and the effects on
the sizes of the tree, the training error, the testing error, and the running time. We
started off by trying density equal to 0.01 to get a rough model to assess the running
time and the size of the tree. The construction of the tree is top-down. At each
step, it chooses a variable that best splits the set of the set of items according to a
function known as the Gini impurity. The Gini impurity measures the probability a
randomly chosen element from the set would be mislabeled if it were randomly labeled
4.3. RANDOM FOREST
Figure 4-12: A decision tree to classify delinquent customer accounts by the variable
OutstandingCapital
41
42
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Density
0.1
0.01
0.001
0.0001
0
Tree size (best)
0
31
33
45
48
CART Results
Tree size (largest)
0
31
45
76
76
Training accuracy
0.5
0.968563
0.9894557
0.9992841
0.9992841
Testing accuracy
0.5
0.7645995
0.779522
0.5426357
0.5426357
Figure 4-13: Accuracy and run time data for CART algorithms as a function of
density and tree size
according to the distribution of the labels rooted at that tree node. It is computed by
summing the probability of each item in the subset being chosen times the probability
of mislabeling it. More formally, given a set S and m classes, the Gini impurity of S
is defined as:
Gini(S) = 1 −
m
X
p2i ,
i=1
where pi is the probability S belongs to set i, and
Pk
i=1
pi = 1. [GT00]
Since data set has fewer than 10 percent delinquent customer accounts, setting
density equal to 0.1 didn’t return any meaningful results, so we will ignore the first
row in the table. As density decreases, we can see that the model makes more
accurate prediction on the training set, which is expected since it is memorizing more
sub-branches in the trees. Testing accuracy also increases but seems to decrease
when density falls below 0.001, which is a sign of overfitting. The optimal tree
size corresponds to the lowest possible error achieved during the computation of the
tree, which seems to be in the 45-48 ranges. The optimal tree is obtained through a
pruning process in which sets of sub-trees generated by eliminating groups of branches
determined by the cost complexity of the CART algorithm.
In Figure 4-14, we can see the relative error as a function of density and its
corresponding tree size. The relative error is defined as
RSS(k)
,
RSS(1)
where RSS(k) is the
residual sum of squares for the tree with k terminal nodes. We can see that the error
first decreases and then increases as the tree size grows. For density = 0.0001, we
can see the relative error is lowest when the tree size is 45 while it converges to its
4.3. RANDOM FOREST
Figure 4-14: Relative error for different tree sizes for density = 0.0001
43
44
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Figure 4-15: ROC curve on testing data for densities shown in Figure 4-13. The
red line corresponds to density = 0.01 and the green curve corresponds to density =
0.001. The other densities overlap each other.
4.3. RANDOM FOREST
target density when the tree size is 76.
In Figure 4-15, we can see the ROC curve corresponding to the classification
errors on our testing data. As density decrease, the ROC curve shifts outwards,
which means the accuracy of the CART model increases. However, overfitting occurs
below density = 0.001, and our model is less accurate on the testing data. Generally
the more outward a ROC curve is, the more accurate it is.
4.3.2
Building Random Forest Models
As seen in the previous section, the accuracy rate of decision tree was very low. We
could try various parameters for the CART algorithm but there doesn’t seem to be
any major improvements. The conclusion we make from this is that the data sets we
have aren’t very suitable for decision trees, because of its tendency to overfit. They
also have high training variance: drawing different samples from the same population
can sometimes produce trees with vastly different structures and accuracy on test
sets. [ZM14] We shall turn to random forest.
We can do better than decision trees by using the technique of bagging, or bootstrap aggregating. In bagging, random samples are drawn from the training data.
We build a decision tree model for each set of samples. The final bagged model is the
average of the results returned by each of the decision trees. This lowers the variance
of decision trees and improves accuracy on testing data.
Random forest is an improvement over bagging with decision trees. When we
do bagging with decision trees, the decision trees themselves are actually correlated.
This is due to the algorithm using same set of features each time. Random forest
de-correlates the trees by randomizing the set of variables the trees are allowed to
use. In the next section, we run random forest for various values for number of trees
and minimum node size.
45
46
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Figure 4-16: ROC curve on training data for various values of number of trees: 1
(black), 10 (red), 100 (green), and 1000 (blue)
4.3. RANDOM FOREST
Figure 4-17: ROC curve on testing data for various values of number of trees: 1
(black), 10 (red), 100 (green), and 1000 (blue)
47
48
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Figure 4-18: ROC curve on training data for various values of minimum node size:
1 (black), 10 (red), 100 (green), and 1000 (blue)
4.3. RANDOM FOREST
Figure 4-19: ROC curve on testing data for various values of minimum node size:
1 (black), 10 (red), 100 (green), and 1000 (blue)
49
50
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
minimum node size
1
10
100
1000
AUC values
number of trees
1
0.9825723
0.9807154
0.9875644
0.9309181
10
0.9999996
0.9999847
0.9983677
0.9825534
100
1
1
0.9993586
0.9921611
1000
1
0.9999998
0.9994379
0.9927579
Figure 4-20: AUC values on testing sets for different values of number of trees and
minimum node size
minimum node size
1
10
100
1000
AUC values
number of trees
1
0.7708656
0.7967054
0.6631783
0.6588501
10
0.797416
0.7167313
0.7478036
0.8605943
100
0.7762274
0.8104651
0.7751938
0.8224806
1000
0.8251938
0.8014212
0.7996124
0.8335917
Figure 4-21: AUC values on training sets for different values of number of trees and
minimum node size
4.3.3
Evaluating the Model
We have chosen to test the values 1, 10, 100, 1000 for number of trees and minimum
node size. In Figure 4-16 and Figure 4-18, we show the ROC curve for the training
set. In Figure 4-17 and Figure 4-19, we show the ROC curve for the testing set. In all
graphs, we can see that the AUC (Area Under the Curve) for the training sets decrease
as the parameter values decrease, while the AUC for the testing sets increase as the
parameter values decrease. AUC is equal to the probability that, given a positive
and a negative sample, it will classify them correctly by assigning a higher value to
the positive sample and a lower value to the negative sample. This means that there
was more overfitting for small values of the parameters and more generalization for
greater values of the parameters.
In Figure 4-20, we show the AUC values for different combinations of our set of
values for number of trees and minimum node size on the training set. As the number
of trees increase, we see that the algorithm memorizes the inputs more, resulting in
4.3. RANDOM FOREST
more accurate predictions on the training sets. However, this could be a sign of
overfitting, as can be seen in the almost square ROC curve. As the minimum node
size increases, we have less accurate predictions on the training set but the algorithm
overfits less and is able to generalize more.
Similarly, Figure 4-21 shows the AUC values for testing set. The highest accuracy
is achieved for 10 trees and minimum node size equal to 1000, with an AUC value of
0.8605943. We tried a range of other values and were able to achieve a value of ROC
of at least as high as 0.8732558.
51
52
CHAPTER 4. USING MACHINE LEARNING ALGORITHMS FOR CREDIT SCORING
Chapter 5
Closing Remarks
In this chapter, we end the paper with a discussion on the results from Chapter 4
and present a recommendation for classification techniques for the Bank to explore.
Finally, we talk about areas in which further research could be done.
5.1
Conclusion
In Chapter 4, we explored 3 commonly used machine-learning algorithms used in
building models for consumer credit risk. We assessed the quality of the methods on
our training and testing sets with ROC curves and AUC values. By looking at this
metric, random forest seems to perform best on our data sets. This is consistent with
the results from other machine learning literature that compares different techniques.
[LRMM13] Random forest proves to be one of the most predictive models for the
classification problem. We thus recommend the Bank to look into further adjusting
parameters in the random forest algorithm, or building a customized version to fill its
specific needs. Random forest is not able to handle features with missing values. If
the Bank chooses to include variables with missing values in the set of features used
by random forest, additional work would be needed to handle this.
54
CHAPTER 5. CLOSING REMARKS
5.2
Further Work
This research paper has a few areas that can be further developed. One obvious area
is to try other machine learning algorithms such as neural networks, which proved
to be effective for credit rating. [HCH+ 04] Another area is to further fine tune the
existing parameters to achieve more accurate models.
We have used ROC curve and AUC values to assess the quality of our models.
Although AUC values are commonly used by machine learning literature to compare
different models, recent research calls into question the reliability and validity of the
AUC estimates, because of possible noise introduced as a classification measure and
that reducing it into a single number ignores the fact that it relates tradeoffs in
the model. Other metrics we should look at for next steps in this research include
Informedness and DeltaP.
Bibliography
[FS14]
D. P. Foster and R. A. Stine. Variable selection in data mining: Building
a predictive model for bankruptcy, 2014. 21
[GT00]
J. Galindo and P. Tamayo. Credit risk assessment using statistical and
machine learning: Basic methodology and risk modeling applications.
Computational Economics, 15:107–143, 2000. 42
[HCH+ 04] Z. Huang, H. Chen, C. J. Hsu, W. H. Chen, and S. Wu. Credit rating
analysis with support vector machines and neural networks: a market
comparative study. Decision Support Systems, 37:543–558, 2004. 54
[Ken13]
Kenneth Kennedy. Credit Scoring Using Machine Learning. PhD thesis,
Dublin Institute of Technology, 2013. 12, 13
[KKL10]
A. E. Khandani, A. J. Kim, and A. W. Lo. Consumer credit-risk models via machine-learning algorithms. Journal of Banking and Finance,
34:2767–2787, 2010. 12
[LRMM13] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan. Interpretable
classifiers using rules and bayesian analysis: Building a better stroke
prediction model, 2013. 53
[ML05]
J. H. Min and Y. C. Lee. Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems
with Applications, 28:603–614, 2005. 34
56
BIBLIOGRAPHY
[Mur12]
K. P. Murphy. Machine Learning: a probabilistic perspective. MIT Press,
Cambridge, MA, 2012. 36
[ZM14]
N. Zumel and J. Mount. Practical Data Science with R. Manning, Shelter
Island, NY, 2014. 45