Uploaded by Rane Amit Dilip

Bank Loan Predictor Rsearch Paper LPU

advertisement
Loan predictor using machine learning for Banks
Amit Rane
Computer science and engineering
Lovely professional university
Phagwara, Punjab, India
amitrane1545@gmail.com
Shivam Kumar
Computer science and engineering
Lovely professional university
Phagwara, Punjab, India
shivkc12@gmail.com
Sagar Dipak Vaidya
Computer science and engineering
Lovely professional university
Phagwara, Punjab, India
sagarvaidya764@gmail.com
Abstract— Even this hard-hitting global pandemic, i.e., COVID-19,
could not stop the financial world from growing. The economies of all
countries, ranging from developed countries to developing countries, have
been growing at a very fast rate. Thus, the availability of credit plays a very
crucial role in development. At the same point in time, there might be a
critically brittle situation for both banks and borrowers. From our past
experiences, we know that if someone has a collection of bad debts, the
easy availability of credit can be a boon, but if it is used carelessly, it can be
a bane.
Index Terms-- ML-Machine Learning, AI-Artificial Intelligence, CNNConvolution Neural Network Internet of Things, IoT-Internet of Things,
SVM-Support Vector Machine
1. Introduction
The most important factor for a suitable and efficient economy is the "efficient allocation of
scarce resources" with respect to the world’s greatest economic principles. Across the globe,
where the world is growing every second, it is very important that money is the most
beneficiary resource. According to the current scenario in the world of finance, "credit" can be
defined as the eligibility to borrow money from the lender. Credit is the most important aspect
for most countries because it is not available easily. Having good credit helps developing
countries expand and develop as money is scarce. Having many bad debits has proved that
when it is easy to access credit, it can be used inappropriately, and it can also end up making
someone bankrupt.
2. Literature Survey
There are a wide range of methods in the current published papers, ranging from employing
only one algorithm, such as Logistic Regression, XG Boost, Decision Tree, to algorithms that
integrate many algorithms.
Because of the algorithm's simplicity, it cannot be used in real life over big datasets with
various deciding factors (for example, name, age, gender, and spousal income). This is used for
exploratory data analysis, as seen in a number of published publications. To improve the
decision-making system, a mix of decision trees and random forests is utilized to determine the
various nodes and choose the determining criteria for loan approval after the initial step of EDA
(Exploratory Data Analysis). Features only selecting the proper characteristics (Age, Gender,
Income, Spousal Income, Credit Score) will provide the right outcome, making selection one of
the most important components of the algorithm. As a result, complete preprocessing of the
dataset with suitable feature selection can considerably improve the model and its accuracy.
To attain maximum practicality, the algorithm must not be too simple, as this would damage
the code's correctness, and it must also not be too complex, as this would impede the code's
utility and, most importantly, speed. To achieve our goal, we looked at a number of papers
written by financial and fintech specialists and came to the conclusion that a combination of
logistic regression, XG Boost, decision tree, and random forest analysis delivers a very high level
of accuracy with no lag in the system's performance.
3. Methodology
A. Data is being loaded and the system is getting ready.
Python is a language that is used for machine learning and AI projects to equip the models.
Different libraries are used for making models, such as pandas, seaborn, python, sklearn, etc.
will be used, and the whole Data sets are divided into two files, i.e., train data set and test data
set. The train datasets will have two types of variables, but the test datasets will have only
independent variables. We will predict the target variable in the test data by asking a model.
We will also make a copy of the training dataset and test dataset so that we will not lose the
original data sets if we want to make changes to them.
B. Data Comprehension
The dataset used in this project consists of 12 independent variables and one target variable.
We have one target variable and 12 different independent variables like gender, married, credit
history, education etc. Further, these variables are classified into three different categories:
1. Categorical variable: - Categorical variable are those variables whose features are
defined in categories.
2. Ordinal variable: - Ordinal variable are the variables in categorical features having some
order or series.
3. Numeric variables: Numeric variables are variables that have numerical values and don’t
require any order or series.
C. Bivariate Analysis
The analysis is achieved by comparing two targets and an independent variable.
The graphs are used to demonstrate the overlap on two-dimensional plots. Huge salary
would have a greater probability of being sanctioned for a loan. Applicants who have a
previous loan will have a better chance of getting a loan. After sanctioning of the loan,
the amount will be transferred to the bank account. The loan acceptance can be
favorable if the applied loan is a very small amount. The applicant chooses a smaller EMI
to repay so that their chances of getting approval are high.
D. Missing Value Treatment
When the applicant submits their information, they forget to include details such as
gender, married, dependents, self-employed, loan amount term, loan amount, and
credit history. In order to guess and fill in these data types, we have performed this
method: For numerical variables, we have used mean and median code. In this method
we have imputed using the mode for categorical variables.
E. Outlier Treatment
When we are working with datasets, sometimes we have input values that are not practically
possible. If we don’t eliminate these outliers in univariate analysis, this will negatively impact
the mean and standard deviation of the data. So, in this case, we use outlier treatment to
eliminate unwanted values. The technical term used for outliers is skewness. We have used log
transformation to eliminate skewness from the data. The main benefit of log transformation as
compared to other methods is that log transformation has very little impact on smaller values
but a huge impact on larger values. Thus, by using Log Transformation, we get a new
distribution which is roughly equal to the normal distribution.
F. Model Building I
Logistic regression is to predict the target variable. We are using these algorithms
in model building. This algorithm is used to classify between independent variables and
outputs then predict, in binary format, which is 1 or 0. The input will be accepted in
binary form, then it is compulsory to convert it all to 0 or 1. Logic functions are
evaluated in these algorithms. This function is used to log events of odd data. First, we
have to train the model on a train data set so that we can predict the test data set.
Validation for prediction will be divided into 2 types. One is validation and training. The
prediction for validation can be done by a training model.
G. Cross-validation of logistic Regression Using Stratifies k-folds
Validation is used to test the model against unknown data. The model will be able
to reserve the
small part of the dataset that the model will not be trained on. The
sample will be used later to test the model just before completing it. As we know, when
we stratify the data, we will get organized data and every fold will constitute the whole
dataset in an organized form. This process is called "stratified k-fold cross-validation." It
gives efficient validation with minimal effort.
H. Feature Engineering
According to our domain expertise, we need to generate a target variable based on
additional features we are using in feature engineering. These are the three new
features we intend to develop:
Total Income:- As we know, the income of the applicant and the coapplicant are combined and indicated in the bivariate analysis of our
project. This feature is important to develop because both datatypes have
similar importance. If these data types are combined together, they form a
high income, through which they have a higher probability of getting loans
sanctioned.

EMI: - As we know, EMI is each month's instalment that the applicant needs
to repay the debit. The motive for using this variable is that those applicants
who will debit a high amount of EMI will not face any difficulty. The formula
for calculating EMI is the amount of the loan taken divided by the length of
the loan tenure to repay the loan.

Balance Income:- After repaying a loan using EMI, the amount left to be
paid is known as "balance income." We need to introduce the concept of
balance income as sometimes we have a very huge amount of debt to
repay, and the odds of a person repaying the loan are high. This helps in
enhancing the chances of getting the loan approved.
I. Model Building II
After performing Feature Engineering and adding some extra features, we are
continuing with building a mode II by using various algorithms like Logistic Regression,
Decision Tree, Random Forest, and XG Boost. These are considered in further steps.
3. Algorithms Used
A. Logistic Regression
In machine learning logistics, regression is used as a classification algorithm.
Then all the independent variables in the dataset are used to train to predict the
target variable. The outputs of a categorial dependent variable are predicted
using logistic regression. Therefore, in categorial format, all the variables should
be converted into binary format prior to executing the model. The prediction of
outputs should be in a format that is 0 or 1, or Yes or No.
B. Decision Tree
A decision tree is a powerful machine learning algorithm used for classification
and prediction. The decision tree works according to the predefined rule where we split
a node into two or more groups.
This splitting of the data is completely dependent on the entropy of the system. Using
this algorithm, we decide the root node and gain various outputs for the given data.
These outputs obtained using this algorithm are also known as leaf nodes.
C. Random Forest
After performing the decision tree algorithm, we will perform a random forest as
an extension to it. Various numbers of branches are combined, i.e., weak
learners, so that we can predict the model accurately. For every learner, the
decision tree model makes a random sampling of variables and rows. The
individual guesses that are produced by the learner are the concluding results.
The Random Forest algorithm predicts very steady and accurate output as it
implements Grid search.
D. XG Boost
A Gradient Boosted Decision Tree is used for implementation. XG Boost sets a target to achieve
the outcome for the model so that it can minimize the error. For the targeted outcome, every
case is based on the gradient of the error. XG Boost is scalable and very accurate. At the time
when we train the model, XG Boost adds new trees and combines them with the errors of prior
trees so that we can make a final prediction. When we add a new model while using gradient
boosting, it uses a gradient descent algorithm, so it can lower the risk of loss.
5. Results
A. Bivariate analysis
For the analysis, we have to make a comparison between the target and independent
variables. To gain a better understanding of this, the percentage of loans granted and
rejected for married and unmarried applicants should be compared to the approval rate
for co-applicants. From these, we compare the left category, independent variable, and
loan sanction. According to the data, a person with one credit score has a better chance
of getting a loan.
B. Outlier Treatment
Outliers have a high impact on the normal distribution of the data, so it is important to
remove all the outliers from the given data. If there are outliers present in the data, this will
provide us with a false mean, median, and standard deviation. Thus, this will result in a wrong
result and slow down the model's decision-making process. We have used Log Transformation
to eliminate skewness from the data. The main benefit of Log Transformation as compared to
other methods is that Log Transformation has very little impact on smaller values but a huge
impact on larger values. Thus, we will come up with a distribution similar to normal distribution.
C. Feature Engineering
Distribution of total income is examined
The EMI feature is built and distribution is tested.
Similarly, The Balance Income feature is developed and distribution is examined.
D.
Results
All the models in our project are trained and validated by using feature engineering
techniques and exploratory data analysis. The stratified k-folds method is used to determine
validation and the mean of all folds is used as accuracy.
To obtain optimal accuracy in a short period of time, we have used a total of five folds. Thus, we
can conclude that random forest with grid search provides optimal accuracy in a shorter
timeframe.
6. Acknowledgement
We are thankful to a number of individuals who have contributed towards our
final year project and without whose help it would not have been possible. Firstly,
we offer our sincere thanks to our project guide, Prof. Shivali Chopra for her
constant and timely help and guidance throughout our preparation.
We are grateful to all project guides for their valuable inputs to our project. We
are also grateful to the college authorities and the entire faculty for their support
in providing us with the facilities required throughout the semester.
7. CONCLUSION AND FUTUTE SCOPE
In this research paper, the methods, algorithms, and models of feature selection of
a model, so that it can precisely select eligible loan applicants to reduce the amount of
bad debts and reduce the losses of banks that are happening because of the loan. This
system was introduced so that we can minimize bank loses and human errors with help
of latest Machine Learning techniques. After preventing this error and human
intervention, we can reduce huge loses to the banks and will also stabilize banks
from huge loses.
8. REFERENCES
[1] Aditya Sarkar, Karedla Krishna Sai. Aditya Prakash, Giddaluru Veera Venkata Sai.
Manjit Kaur "Loan Delinquency Prediction Using Machine Learning Techniques"
International Research Journal of Engineering and Technology (IRJET), Volume: 08 Issue:
04 Apr 2021, e-ISSN: 23950056
[2] Kshitiz Gautam, Arun Pratap Singh, Keshav Tyagi, Mr. Suresh Kumar "Loan Prediction
using Decision Tree and Random Forest” International Research Journal of Engineering
and Technology (IRJET), Volume: 07 Issue: 08: Aug 2020
[3] Ahmad Al-qerem. Ghazi Al-Naymal, Mays Alhasan "Loan Default Prediction Model
Improvement through Comprehensive Preprocessing and Features Selection” 2019
International
Arab
Conference
on
Information
Technology
(ACIT):
DOI:
10.1109/ACIT47987.2019,8991084
[4] Zoran Ereiz "Predicting Default Loans Using Machine Learning (OptiML)” 2019 27th
Telecommunications
Forum
(TELFOR)
30
January
2020;
DOI:
10.1109/TELFOR48224.2019.8971110
[5] Jian Chen. Ani L. Katchova. Chenxi Zhou "Agricultural loan delinquency prediction
using machine learning methods" International Food and Agribusiness Management
Review May 31, 2021
[6] Kalyani R. Rawate. Prof. P. A. Tijare "Prediction System For Bank Loan Credibility”
Scientific Joumal of Impact Factor (SjIF): 4.72;: Volume 4,Issue 12. December -2017
(7) Khaled A. Althelaya, El-Sayed M. El-Atfy, Salahadin Mohammed "Evaluation of
bidirectional LSTM for short and long-term stock market prediction" 2018 9th
Intemational Conference on Information and Communication Systems (ICICS) DOI:
10.11094ACS 2018.8355458
Download