Loan predictor using machine learning for Banks Amit Rane Computer science and engineering Lovely professional university Phagwara, Punjab, India amitrane1545@gmail.com Shivam Kumar Computer science and engineering Lovely professional university Phagwara, Punjab, India shivkc12@gmail.com Sagar Dipak Vaidya Computer science and engineering Lovely professional university Phagwara, Punjab, India sagarvaidya764@gmail.com Abstract— Even this hard-hitting global pandemic, i.e., COVID-19, could not stop the financial world from growing. The economies of all countries, ranging from developed countries to developing countries, have been growing at a very fast rate. Thus, the availability of credit plays a very crucial role in development. At the same point in time, there might be a critically brittle situation for both banks and borrowers. From our past experiences, we know that if someone has a collection of bad debts, the easy availability of credit can be a boon, but if it is used carelessly, it can be a bane. Index Terms-- ML-Machine Learning, AI-Artificial Intelligence, CNNConvolution Neural Network Internet of Things, IoT-Internet of Things, SVM-Support Vector Machine 1. Introduction The most important factor for a suitable and efficient economy is the "efficient allocation of scarce resources" with respect to the world’s greatest economic principles. Across the globe, where the world is growing every second, it is very important that money is the most beneficiary resource. According to the current scenario in the world of finance, "credit" can be defined as the eligibility to borrow money from the lender. Credit is the most important aspect for most countries because it is not available easily. Having good credit helps developing countries expand and develop as money is scarce. Having many bad debits has proved that when it is easy to access credit, it can be used inappropriately, and it can also end up making someone bankrupt. 2. Literature Survey There are a wide range of methods in the current published papers, ranging from employing only one algorithm, such as Logistic Regression, XG Boost, Decision Tree, to algorithms that integrate many algorithms. Because of the algorithm's simplicity, it cannot be used in real life over big datasets with various deciding factors (for example, name, age, gender, and spousal income). This is used for exploratory data analysis, as seen in a number of published publications. To improve the decision-making system, a mix of decision trees and random forests is utilized to determine the various nodes and choose the determining criteria for loan approval after the initial step of EDA (Exploratory Data Analysis). Features only selecting the proper characteristics (Age, Gender, Income, Spousal Income, Credit Score) will provide the right outcome, making selection one of the most important components of the algorithm. As a result, complete preprocessing of the dataset with suitable feature selection can considerably improve the model and its accuracy. To attain maximum practicality, the algorithm must not be too simple, as this would damage the code's correctness, and it must also not be too complex, as this would impede the code's utility and, most importantly, speed. To achieve our goal, we looked at a number of papers written by financial and fintech specialists and came to the conclusion that a combination of logistic regression, XG Boost, decision tree, and random forest analysis delivers a very high level of accuracy with no lag in the system's performance. 3. Methodology A. Data is being loaded and the system is getting ready. Python is a language that is used for machine learning and AI projects to equip the models. Different libraries are used for making models, such as pandas, seaborn, python, sklearn, etc. will be used, and the whole Data sets are divided into two files, i.e., train data set and test data set. The train datasets will have two types of variables, but the test datasets will have only independent variables. We will predict the target variable in the test data by asking a model. We will also make a copy of the training dataset and test dataset so that we will not lose the original data sets if we want to make changes to them. B. Data Comprehension The dataset used in this project consists of 12 independent variables and one target variable. We have one target variable and 12 different independent variables like gender, married, credit history, education etc. Further, these variables are classified into three different categories: 1. Categorical variable: - Categorical variable are those variables whose features are defined in categories. 2. Ordinal variable: - Ordinal variable are the variables in categorical features having some order or series. 3. Numeric variables: Numeric variables are variables that have numerical values and don’t require any order or series. C. Bivariate Analysis The analysis is achieved by comparing two targets and an independent variable. The graphs are used to demonstrate the overlap on two-dimensional plots. Huge salary would have a greater probability of being sanctioned for a loan. Applicants who have a previous loan will have a better chance of getting a loan. After sanctioning of the loan, the amount will be transferred to the bank account. The loan acceptance can be favorable if the applied loan is a very small amount. The applicant chooses a smaller EMI to repay so that their chances of getting approval are high. D. Missing Value Treatment When the applicant submits their information, they forget to include details such as gender, married, dependents, self-employed, loan amount term, loan amount, and credit history. In order to guess and fill in these data types, we have performed this method: For numerical variables, we have used mean and median code. In this method we have imputed using the mode for categorical variables. E. Outlier Treatment When we are working with datasets, sometimes we have input values that are not practically possible. If we don’t eliminate these outliers in univariate analysis, this will negatively impact the mean and standard deviation of the data. So, in this case, we use outlier treatment to eliminate unwanted values. The technical term used for outliers is skewness. We have used log transformation to eliminate skewness from the data. The main benefit of log transformation as compared to other methods is that log transformation has very little impact on smaller values but a huge impact on larger values. Thus, by using Log Transformation, we get a new distribution which is roughly equal to the normal distribution. F. Model Building I Logistic regression is to predict the target variable. We are using these algorithms in model building. This algorithm is used to classify between independent variables and outputs then predict, in binary format, which is 1 or 0. The input will be accepted in binary form, then it is compulsory to convert it all to 0 or 1. Logic functions are evaluated in these algorithms. This function is used to log events of odd data. First, we have to train the model on a train data set so that we can predict the test data set. Validation for prediction will be divided into 2 types. One is validation and training. The prediction for validation can be done by a training model. G. Cross-validation of logistic Regression Using Stratifies k-folds Validation is used to test the model against unknown data. The model will be able to reserve the small part of the dataset that the model will not be trained on. The sample will be used later to test the model just before completing it. As we know, when we stratify the data, we will get organized data and every fold will constitute the whole dataset in an organized form. This process is called "stratified k-fold cross-validation." It gives efficient validation with minimal effort. H. Feature Engineering According to our domain expertise, we need to generate a target variable based on additional features we are using in feature engineering. These are the three new features we intend to develop: Total Income:- As we know, the income of the applicant and the coapplicant are combined and indicated in the bivariate analysis of our project. This feature is important to develop because both datatypes have similar importance. If these data types are combined together, they form a high income, through which they have a higher probability of getting loans sanctioned. EMI: - As we know, EMI is each month's instalment that the applicant needs to repay the debit. The motive for using this variable is that those applicants who will debit a high amount of EMI will not face any difficulty. The formula for calculating EMI is the amount of the loan taken divided by the length of the loan tenure to repay the loan. Balance Income:- After repaying a loan using EMI, the amount left to be paid is known as "balance income." We need to introduce the concept of balance income as sometimes we have a very huge amount of debt to repay, and the odds of a person repaying the loan are high. This helps in enhancing the chances of getting the loan approved. I. Model Building II After performing Feature Engineering and adding some extra features, we are continuing with building a mode II by using various algorithms like Logistic Regression, Decision Tree, Random Forest, and XG Boost. These are considered in further steps. 3. Algorithms Used A. Logistic Regression In machine learning logistics, regression is used as a classification algorithm. Then all the independent variables in the dataset are used to train to predict the target variable. The outputs of a categorial dependent variable are predicted using logistic regression. Therefore, in categorial format, all the variables should be converted into binary format prior to executing the model. The prediction of outputs should be in a format that is 0 or 1, or Yes or No. B. Decision Tree A decision tree is a powerful machine learning algorithm used for classification and prediction. The decision tree works according to the predefined rule where we split a node into two or more groups. This splitting of the data is completely dependent on the entropy of the system. Using this algorithm, we decide the root node and gain various outputs for the given data. These outputs obtained using this algorithm are also known as leaf nodes. C. Random Forest After performing the decision tree algorithm, we will perform a random forest as an extension to it. Various numbers of branches are combined, i.e., weak learners, so that we can predict the model accurately. For every learner, the decision tree model makes a random sampling of variables and rows. The individual guesses that are produced by the learner are the concluding results. The Random Forest algorithm predicts very steady and accurate output as it implements Grid search. D. XG Boost A Gradient Boosted Decision Tree is used for implementation. XG Boost sets a target to achieve the outcome for the model so that it can minimize the error. For the targeted outcome, every case is based on the gradient of the error. XG Boost is scalable and very accurate. At the time when we train the model, XG Boost adds new trees and combines them with the errors of prior trees so that we can make a final prediction. When we add a new model while using gradient boosting, it uses a gradient descent algorithm, so it can lower the risk of loss. 5. Results A. Bivariate analysis For the analysis, we have to make a comparison between the target and independent variables. To gain a better understanding of this, the percentage of loans granted and rejected for married and unmarried applicants should be compared to the approval rate for co-applicants. From these, we compare the left category, independent variable, and loan sanction. According to the data, a person with one credit score has a better chance of getting a loan. B. Outlier Treatment Outliers have a high impact on the normal distribution of the data, so it is important to remove all the outliers from the given data. If there are outliers present in the data, this will provide us with a false mean, median, and standard deviation. Thus, this will result in a wrong result and slow down the model's decision-making process. We have used Log Transformation to eliminate skewness from the data. The main benefit of Log Transformation as compared to other methods is that Log Transformation has very little impact on smaller values but a huge impact on larger values. Thus, we will come up with a distribution similar to normal distribution. C. Feature Engineering Distribution of total income is examined The EMI feature is built and distribution is tested. Similarly, The Balance Income feature is developed and distribution is examined. D. Results All the models in our project are trained and validated by using feature engineering techniques and exploratory data analysis. The stratified k-folds method is used to determine validation and the mean of all folds is used as accuracy. To obtain optimal accuracy in a short period of time, we have used a total of five folds. Thus, we can conclude that random forest with grid search provides optimal accuracy in a shorter timeframe. 6. Acknowledgement We are thankful to a number of individuals who have contributed towards our final year project and without whose help it would not have been possible. Firstly, we offer our sincere thanks to our project guide, Prof. Shivali Chopra for her constant and timely help and guidance throughout our preparation. We are grateful to all project guides for their valuable inputs to our project. We are also grateful to the college authorities and the entire faculty for their support in providing us with the facilities required throughout the semester. 7. CONCLUSION AND FUTUTE SCOPE In this research paper, the methods, algorithms, and models of feature selection of a model, so that it can precisely select eligible loan applicants to reduce the amount of bad debts and reduce the losses of banks that are happening because of the loan. This system was introduced so that we can minimize bank loses and human errors with help of latest Machine Learning techniques. After preventing this error and human intervention, we can reduce huge loses to the banks and will also stabilize banks from huge loses. 8. REFERENCES [1] Aditya Sarkar, Karedla Krishna Sai. Aditya Prakash, Giddaluru Veera Venkata Sai. Manjit Kaur "Loan Delinquency Prediction Using Machine Learning Techniques" International Research Journal of Engineering and Technology (IRJET), Volume: 08 Issue: 04 Apr 2021, e-ISSN: 23950056 [2] Kshitiz Gautam, Arun Pratap Singh, Keshav Tyagi, Mr. Suresh Kumar "Loan Prediction using Decision Tree and Random Forest” International Research Journal of Engineering and Technology (IRJET), Volume: 07 Issue: 08: Aug 2020 [3] Ahmad Al-qerem. Ghazi Al-Naymal, Mays Alhasan "Loan Default Prediction Model Improvement through Comprehensive Preprocessing and Features Selection” 2019 International Arab Conference on Information Technology (ACIT): DOI: 10.1109/ACIT47987.2019,8991084 [4] Zoran Ereiz "Predicting Default Loans Using Machine Learning (OptiML)” 2019 27th Telecommunications Forum (TELFOR) 30 January 2020; DOI: 10.1109/TELFOR48224.2019.8971110 [5] Jian Chen. Ani L. Katchova. Chenxi Zhou "Agricultural loan delinquency prediction using machine learning methods" International Food and Agribusiness Management Review May 31, 2021 [6] Kalyani R. Rawate. Prof. P. A. Tijare "Prediction System For Bank Loan Credibility” Scientific Joumal of Impact Factor (SjIF): 4.72;: Volume 4,Issue 12. December -2017 (7) Khaled A. Althelaya, El-Sayed M. El-Atfy, Salahadin Mohammed "Evaluation of bidirectional LSTM for short and long-term stock market prediction" 2018 9th Intemational Conference on Information and Communication Systems (ICICS) DOI: 10.11094ACS 2018.8355458