RECOMMENDATIONS FROM KAGGLE CHALLENGE By: Hyacinth Ampadu • Most individuals are working(over 50%) • only 0.30% of them have bad loans • Compared to the pensioners, who account for about 17% of the individuals • 0.48% of them have bad loans • Students, who are less than 1%, have no bad loans • 90% of individuals have a House/apartment • only 0.35% of all individuals having a house or apartment (2nd lowest) have bad loans. • Co-apartments accounted for only 0.47% of all housing types • but had the most bad loans, with 0.82% of individuals with co-op apartments having bad loans • Married individuals accounted for over 70% of the entire population • Out of that 70%, only 0.36% of them had bad loans • Compared to the widows who were only 4% • But 0.48% of them had bad loans • Though the number of bad loans is small, married individuals should be targeted more as they tend to honor their loans Recommendations from Analysis • Working individuals could be targeted more since they have a low probability of having bad loans • Individuals in a co-op apartment are the most likely to have bad loans, even though they are the least, so care has to be taken on such individuals • Married individuals should be targeted more, whiles widows should be studied more as they tend to have bad loans even though they are not that many Predictive Modelling Choice of Algorithms • There should be a tradeoff between accuracy and interpretability so the decision to reject the application is understood by all while also being able to reject with a high degree of accuracy • Logistic regression is the most interpretable model available for classification, but it’s not that high performing • Neural network models are very high-performant models but they are like a black box, with no interpretability • Ensemble models such as random forest have the best of both worlds, as they are very high-performing models, and have to a degree, some interpretability(though not as much as the logistic regression model) • For these reasons, the Logistic regression and random forest are the models used here Data Imbalance • Data is highly unbalanced with less than 1% of the individuals having a bad loan • Three ways were used, first the models were trained with the imbalance in place, then the models adjust their class weights to take care of the imbalance, then the data itself is sampled to cater to the imbalance • The results achieved would be demonstrated in the following sections Results – Imbalanced dataset Logistic Regression • Training with the unbalanced dataset, from the confusion matrix, it can be seen that what we actually care about, which is the bad loans(signified as 0 here), none of the individuals were predicted for the logistic regression model, • Even though the model had an accuracy of about 99.6% • Suggesting, accuracy is not a good metric to evaluate unbalanced datasets, and we need to fix this unbalancing to capture the individuals with bad loans. • The recall is 0 for logistic regression Results – Imbalanced dataset Random Forest • The random forest predicted only 21 individuals with bad loans • It had a false negative of 545 individuals, which means the individuals that had bad loans but were predicted to have good loans • Which is what we are most interested in, because we are interested in minimizing the false negatives, • so that we catch a lot of the individuals that have bad loans, hence the recall is key • As can be seen, the recall is 0.04 for random forest Results – Class weight to mitigate imbalance Logistic Regression • Here, the models class weight is used to fix the imbalance • Originally, all the models had a class weight of 1, but the model’s class weight is balanced, meaning the two classes have the same contribution, which takes care of the imbalance a little • The logistic regression recall improved from 0 to 0.55, by identifying 311 individuals that had bad loans, and reduced the false negatives from 566 to 255, which is significant Results – Class weight to mitigate imbalance Random Forest • The random forest’s recall improved from 0.04 to 0.87, by identifying 494 individuals that had bad loans, and reduced the false negatives from 545 to 72, which is significant • One thing to note is, reducing false negatives, increases false positives, reduces recall, and increases precision, so care has to be taken to the extent of the reduction • as now there are many people that don’t have bad loans but the model predicts them to have bad loans, hence a delay and more time wasted making all the checks on such people • Precision reduced from 0.70 to 0.03, which in its sense is not good Results – Sampling to mitigate imbalance Logistic Regression • Sampled using smote, by increasing minority samples(bad loans) to be 70% of the number of majority samples by making copies of itself • The logistic regression’s recall here was 0.2 and identified 115 individuals with bad loans • The model had 451 false negatives Results – Sampling to mitigate imbalance • The Random forest’s recall here was 0.82 and identified 463individuals with bad loans • The model had 103 false negatives RESULTS SUMMARY MODEL IMBALANCE STRATEGY RESULTS(RECALL) BAD LOANS IDENTIFIED Logistic Regression None 0 0 Random Forest None 0.04 21 Logistic Regression Balanced class weights 0.55 311 Random Forest Balanced class weights 0.87 494 Logistic Regression Smote resampling 0.2 115 Random Forest Smote resampling 0.82 463 INTEPRETABILITY Random Forest • The income an individual makes is by far the most important in determining if they would be a good or bad loan • The number of children is the most insignificant INTEPRETABILITY Logistic Regression The Logistic regression model considers the number of children as having the most impact to predict loans, which is the opposite of the random forest which identified that as the least important An increase in the Number of children(cnt_children) changes (increases) the odds of a good loan vs. bad loan by a factor of 4.49, when all other features remain the same If the amt_income_total variable increases by one, the odds of a good loan reduces by 7.5 percent. For individuals having their own car(flag_own_car ), the odds for a good loan vs. bad loan are by a factor of 0.93 lower, compared to individuals without their own car, given all other features stay the same DISCUSSION • The Random forest with balanced weight gave us the best result and the best chance to identify individuals with bad loans so the bank can identify such individuals and take the necessary actions • The resampling makes copies of itself, so the model isn’t learning any new information but more of the same, so that particular technique is a bit alarming to consider • In reducing false negatives(recall), the precision is greatly increased, even though that is not our main metric, but it’s still necessary to reduce the number of individuals that have good loans, but are predicted to have bad loans to avoid wastage of time in checking these individuals • Another metric to use is the f1 score which is a harmonic mean of the recall and precision, so we don’t just optimize for recall, but don’t miss out on good loans RECOMMENDATIONS BASED ON PREDICTIVE MODELS • The random forest model(balanced with class weight) should be used to identify bad loans, but it needs a lot of work to be used in practice • If interpretability is more of a priority than accuracy(which could be the case here), logistic regression should be used, since we can easily explain why an individual would be a bad loan, though it doesn’t have as high of an accuracy What I could have done more if I had time • More feature engineering, especially the date variables • Hyperparameter tuning, both for the model hyperparameters and also for identifying the best class weights • Dive deeper to investigate the occupation field which had nulls, and use it in the prediction • Correlation analysis, multicollinearity, and other statistical tests • more in-depth look at the imbalance • A look at other models such as xgboost and Light GBM • Better construction of the labels based on more in-depth logic