Uploaded by ampaduh

Kaggle Presentation - Hyacinth Ampadu

advertisement
RECOMMENDATIONS FROM
KAGGLE CHALLENGE
By: Hyacinth Ampadu
• Most individuals are working(over
50%)
• only 0.30% of them have bad loans
• Compared to the pensioners, who
account for about 17% of the
individuals
• 0.48% of them have bad loans
• Students, who are less than 1%, have
no bad loans
• 90% of individuals have a
House/apartment
• only 0.35% of all individuals
having a house or apartment (2nd
lowest) have bad loans.
• Co-apartments accounted for only
0.47% of all housing types
• but had the most bad loans, with
0.82% of individuals with co-op
apartments having bad loans
• Married individuals accounted
for over 70% of the entire
population
• Out of that 70%, only 0.36% of
them had bad loans
• Compared to the widows who
were only 4%
• But 0.48% of them had bad
loans
•
Though the number of bad
loans is small, married
individuals should be targeted
more as they tend to honor
their loans
Recommendations from Analysis
• Working individuals could be targeted more since they
have a low probability of having bad loans
• Individuals in a co-op apartment are the most likely to have
bad loans, even though they are the least, so care has to
be taken on such individuals
• Married individuals should be targeted more, whiles
widows should be studied more as they tend to have bad
loans even though they are not that many
Predictive Modelling
Choice of Algorithms
• There should be a tradeoff between accuracy and interpretability so the decision to
reject the application is understood by all while also being able to reject with a high
degree of accuracy
• Logistic regression is the most interpretable model available for classification, but it’s
not that high performing
• Neural network models are very high-performant models but they are like a black box,
with no interpretability
• Ensemble models such as random forest have the best of both worlds, as they are very
high-performing models, and have to a degree, some interpretability(though not as
much as the logistic regression model)
• For these reasons, the Logistic regression and random forest are the models used here
Data Imbalance
• Data is highly unbalanced with less than 1% of the individuals
having a bad loan
• Three ways were used, first the models were trained with the
imbalance in place, then the models adjust their class weights to
take care of the imbalance, then the data itself is sampled to cater
to the imbalance
• The results achieved would be demonstrated in the following
sections
Results – Imbalanced dataset
Logistic Regression
• Training with the unbalanced dataset, from the
confusion matrix, it can be seen that what we
actually care about, which is the bad
loans(signified as 0 here), none of the
individuals were predicted for the logistic
regression model,
•
Even though the model had an accuracy of
about 99.6%
• Suggesting, accuracy is not a good metric to
evaluate unbalanced datasets, and we need to
fix this unbalancing to capture the individuals
with bad loans.
• The recall is 0 for logistic regression
Results – Imbalanced dataset
Random Forest
• The random forest predicted only 21
individuals with bad loans
•
It had a false negative of 545 individuals,
which means the individuals that had bad
loans but were predicted to have good loans
•
Which is what we are most interested in,
because we are interested in minimizing the
false negatives,
• so that we catch a lot of the individuals that
have bad loans, hence the recall is key
• As can be seen, the recall is 0.04 for random
forest
Results – Class weight to mitigate imbalance
Logistic Regression
• Here, the models class weight is used to fix the
imbalance
• Originally, all the models had a class weight of 1,
but the model’s class weight is balanced, meaning
the two classes have the same contribution, which
takes care of the imbalance a little
• The logistic regression recall improved from 0 to
0.55, by identifying 311 individuals that had bad
loans, and reduced the false negatives from 566 to
255, which is significant
Results – Class weight to mitigate imbalance
Random Forest
• The random forest’s recall improved from 0.04
to 0.87, by identifying 494 individuals that had
bad loans, and reduced the false negatives
from 545 to 72, which is significant
• One thing to note is, reducing false negatives,
increases false positives, reduces recall, and
increases precision, so care has to be taken
to the extent of the reduction
•
as now there are many people that don’t have
bad loans but the model predicts them to
have bad loans, hence a delay and more time
wasted making all the checks on such people
• Precision reduced from 0.70 to 0.03, which in
its sense is not good
Results – Sampling to mitigate imbalance
Logistic Regression
• Sampled using smote, by increasing minority
samples(bad loans) to be 70% of the number
of majority samples by making copies of
itself
• The logistic regression’s recall here was 0.2
and identified 115 individuals with bad loans
• The model had 451 false negatives
Results – Sampling to mitigate imbalance
• The Random forest’s recall here was
0.82 and identified 463individuals
with bad loans
• The model had 103 false negatives
RESULTS SUMMARY
MODEL
IMBALANCE STRATEGY
RESULTS(RECALL)
BAD LOANS IDENTIFIED
Logistic Regression
None
0
0
Random Forest
None
0.04
21
Logistic Regression
Balanced class weights
0.55
311
Random Forest
Balanced class weights
0.87
494
Logistic Regression
Smote resampling
0.2
115
Random Forest
Smote resampling
0.82
463
INTEPRETABILITY
Random Forest
• The income an
individual makes is by
far the most important in
determining if they
would be a good or bad
loan
• The number of children
is the most insignificant
INTEPRETABILITY
Logistic Regression
The Logistic regression model considers the number of
children as having the most impact to predict loans,
which is the opposite of the random forest which
identified that as the least important
An increase in the Number of children(cnt_children)
changes (increases) the odds of a good loan vs. bad
loan by a factor of 4.49, when all other features remain
the same
If the amt_income_total variable increases by one, the
odds of a good loan reduces by 7.5 percent.
For individuals having their own car(flag_own_car ), the
odds for a good loan vs. bad loan are by a factor of 0.93
lower, compared to individuals without their own car,
given all other features stay the same
DISCUSSION
• The Random forest with balanced weight gave us the best result and the best chance
to identify individuals with bad loans so the bank can identify such individuals and take
the necessary actions
• The resampling makes copies of itself, so the model isn’t learning any new information
but more of the same, so that particular technique is a bit alarming to consider
• In reducing false negatives(recall), the precision is greatly increased, even though that
is not our main metric, but it’s still necessary to reduce the number of individuals that
have good loans, but are predicted to have bad loans to avoid wastage of time in
checking these individuals
• Another metric to use is the f1 score which is a harmonic mean of the recall and
precision, so we don’t just optimize for recall, but don’t miss out on good loans
RECOMMENDATIONS BASED ON PREDICTIVE
MODELS
• The random forest model(balanced with class weight) should be used
to identify bad loans, but it needs a lot of work to be used in practice
• If interpretability is more of a priority than accuracy(which could be
the case here), logistic regression should be used, since we can easily
explain why an individual would be a bad loan, though it doesn’t
have as high of an accuracy
What I could have done more if I had time
• More feature engineering, especially the date variables
• Hyperparameter tuning, both for the model hyperparameters and
also for identifying the best class weights
• Dive deeper to investigate the occupation field which had nulls, and
use it in the prediction
• Correlation analysis, multicollinearity, and other statistical tests
• more in-depth look at the imbalance
• A look at other models such as xgboost and Light GBM
• Better construction of the labels based on more in-depth logic
Download