Analysis and Assessment of Credit rating model in P2P lending An instrument to solve information asymmetry between lenders and borrowers By Yang Yang B.Sc. Management of Science and Project University of Science and Technology of China, 2007 SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR DEGREE OF MASTER OF SCIENCE IN MANAGEMENT STUDIES ARCHNES MASSACHUSETTS INSTITUTE OF TECHNOLOLGY AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 4 2015 JUNE 2015 LIBRARIES 2015 Yang Yang. All rights reserved The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now know or hereafter created. Signature of Author: Signature redacted MIT Sloan School of Management Signature redacted May 8, 2015 Certified by: Christian Catalini Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic Management Signature redacted Thesis Supervisor Accepted by: Michael A. Cusumano SMR Distinguished Professor of Management Program Director, M.S. in Management Studies Program MIT Sloan School Of Management I 2 Analysis and Assessment of Credit rating model in P2P lending An instrument to solve information asymmetry between lenders and borrowers By Yang Yang Submitted to MIT Sloan School of Management on May 8, 2015 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Management Studies. ABSTRACT Since the establishment of the first P2P lending platform in 2005, P2P lending industry has been nibbling the market share of traditional consumer credit. In 2014, Lending Club and Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over traditional banks, the market for P2P lending is expected to grow rapidly along with the improvement of the internal system of P2P lending platforms, external regulation and more participation from borrowers and lenders. Given the fact that most P2P lending platforms in China first imitated the business model from either the U.S. or European platforms, they have progressively evolved to incorporate different business models due to legislation, economic or behavioral reasons. Several findings are detected by analyzing the data form Lending Club and Prosper. First, although both platforms progressively improve the default rate each year, currently both platforms offer negative returns for investors. Second, if only considering finished/matured loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan will cost a loss more than twice as much as the interest return offered to investors. Taking this cost matrix into consideration, the optimal data model won't necessarily provide the highest accuracy but maximum return. Fourth, the ex post return offered by the platforms is not enough to cover the potential risk facing investors. Thesis Supervisor: Christian Catalini Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic Management 3 4 Analysis and Assessment of Credit rating model in P2P lending An instrument to solve information asymmetry between lenders and borrowers By Yang Yang SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR DEGREE OF MASTER OF SCIENCE IN MANAGEMENT STUDIES AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUNE 2015 PURPOSES OF THIS PAPER It's been almost 10 years since the first P2P lending platform was founded in the UK. While P2P lending has been growing rapidly within the past 10 years, it is still in the infant stage compared to the traditional banking industry. There are over 70 academic papers about P2P lending between 2008 and 2015, but from different perspectives, including analyses of determinants of a loan to be successfully funded by investors, regulations, credit risks, determinants of credit quality and default probability, business model of P2P lending across countries, internal information system and literature reviews. Even though a handful of papers did research on credit risks using data mining methodologies, most of them were focused on explaining the determinants of a loan being successfully funded. Few literature considered cost matrix in the model or compared results from Prosper and Lending Club. P2P lending is a two-sided market. In order to further boost market growth, P2P lending platforms also need to enhance the ability of investors to assess credit risks. By doing this, Platforms can offer higher return, and thus, attract more participation of investors in lending activity. The main purpose of this paper is to identify key determinants of a loan's default probability and respective coefficients, and then build the optimal model to predict the loan's status. This model will act as a way to mitigate information asymmetry on P2P lending and gaming philosophy of borrowers. Besides, this paper will also take a dynamic review of the current development of P2P lending built on previous literature. Another motivation for this paper is that the Chinese government just granted the participation of personal credit rating business from non-state owned companies. The public believes this movement will become the game changer for the internet finance industry, especially the P2P lending segment. This paper will justify whether a 3 rd party credit rating 3 will help investors prevent adverse selections. ABSTRACT Since the establishment of the first P2P lending platform in 2005, P2P lending industry has been nibbling the market share of traditional consumer credit. In 2014, Lending Club and Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over traditional banks, the market for P2P lending is expected to grow rapidly along with the improvement of the internal system of P2P lending platforms, external regulation and more participation from borrowers and lenders. Given the fact that most P2P lending platforms in China first imitated the business model from either the U.S. or European platforms, they have progressively evolved to incorporate different business models due to legislation, economic or behavioral reasons. Several findings are detected by analyzing the data form Lending Club and Prosper. First, although both platforms progressively improve the default rate each year, currently both platforms offer negative returns for investors. Second, if only considering finished/matured loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan will cost a loss more than twice as much as the interest return offered to investors. Taking this cost matrix into consideration, the optimal data model won't necessarily provide the highest accuracy but maximum return. Fourth, the ex post return offered by the platforms is not enough to cover the potential risk facing investors. Thesis Supervisor: Christian Catalini Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic Management 4 Table of Contents 1. INTRODUCTION................................................................................................................................. 6 1.1 DEFINITION OF P2P LENDING ................................................................................................................... 7 1.2 How P2P LENDING W ORKS (LENDING CLUB, PROSPER)................................................................................7 2. MARKET REVIEW OF P2P LENDING ............................................................................................. 10 2.1 MEARKET SIZE ....................................................................................................................................... 10 2.2 KEY PLAYERS AND RESPECTIVE M ARKETPLACE .......................................................................................... 11 2.3 M ARKET OUTLOOK OF P2P LENDING ....................................................................................................... 13 2.4 BUSINESS M ODELS OF P2P LENDING........................................................................................................15 3. DATA ANALYSIS AND M ODELING ............................................................................................. 19 3.1 INTRODUCTION .................................................................................................................................... 19 3.2 KEY VARIABLES ..................................................................................................................................... 20 3.2.1 Prosper ..................................................................................................................................... 20 3.2.2 Lending Club.............................................................................................................................20 3.3 DISTRIBUTION OF DATASET ..................................................................................................................... 3.3.1 Prosper ..................................................................................................................................... 21 21 3.3.2 Lending Club.............................................................................................................................24 3.4 M ODEL BUILDING AND INTERPRETATION-LENDING CLUB ........................................................................... 26 3.4.1 Data Preparation......................................................................................................................27 3.4.2 M odel Building ......................................................................................................................... 29 3.4.3 M odel interpretation ................................................................................................................ 32 3.4.4 Robustness Check ..................................................................................................................... 34 3.5 M ODEL BUILDING AND INTERPRETATION-PROSPER ...................................................................................... 38 3.5.1 Data Preparation......................................................................................................................38 3.5.2 M odel Building ......................................................................................................................... 43 3.5.3 M odel interpretation ................................................................................................................ 47 3.5.4 Robustness Check ..................................................................................................................... 49 5 3.6 COMPARISON OF FINDINGS IN MODEL BUILDING FOR LENDING CLUB AND PROSPER ....................................... 53 3.6 .1 Sim ilarities................................................................................................................................5 3 3 .6 .2 Differences................................................................................................................................54 3.6.3 Lessons for China's P2P Lending ......................................................................................... 4. CO NCLUSIO N. ................................................................................................................................. 4.1. CONCLUSION OF THIS PAPER................................................................................................................ 55 56 56 4.2. FURTHER RESEARCH PROPOSED.............................................................................................................58 5. REFERENCES.................................................................................................................................... 58 1. Introduction Freedman and Zhe Jin (2007) wrote the first academic paper to look into the business of P2P lending. They brought up the question of whether P2P lending would reshape the future of the financial industry or if P2P lending would be a fad that would wane over time. Even though it's been over 6 years since that paper, it's still too early to give an answer to that question, whereas what we see on the market is the emergence of more P2P lending platforms globally and the IPO of Lending Club in December 2014. In addition, the attitude of traditional banks toward this infant industry is also evolving. For instance, in early 2014, one employee of Wells Fargo told the media that one internal email was sent by the principal requesting all employees of Wells Fargo not to get engaged with any business of P2P lending. By contrast, many hedge funds or regional banks are purchasing personal Loan products from P2P lending platforms due to stable and attractive return. In addition, more traditional financial institutions also opened their own P2P platforms to catch up with the trend. 6 1.1 Definition of P2P Lending P2P stands for Peer-to-Peer or Person-To-Person. In P2P lending, platforms act as intermediaries matching lenders and borrowers, and transact the money. P2P lending was first introduced by Zopa in UK, 2005. By the time of this paper, Zopa has originated 713 million GBP and is one of the biggest platforms in the world. The emergence of P2P lending is also a result of applying web 2.0 in financial industry. By reducing the overhead cost and infrastructure of traditional banks, P2P lending platforms can offer lower interest rate for borrowers and accumulate huge traffic within a short period (Dhand et al., 2008). 1.2 How P2P Lending Works (Lending Club, Prosper) fl~ctdApk*u Lafure fistimp k"vma Borrowers want to apply for personal loans for various reasons. The main reason of personal loans on Lending Club and Prosper is credit consolidation. A borrower applies for loans by providing private information such as loan amount, term, credit rating score, debt-to-income 7 ratio, monthly income, occupation and the loan purpose. Both platforms will then assess the information and decide a fixed interest rate for the loan. After the interest is agreed on by the borrower, the loan will be listed on the platform for investors to browse. Then investors can browse loan information and decide whether to invest and how much to invest. Among the 73 papers on P2P lending between 2008 and 2015, 20 papers discussed how to increase the possibility of loans being successfully funded and what are the key determinants. Compared with unverified variables, verified variables play a much more significant role in determining whether to invest a loan (Gregor, et al., 2010). Also, borrowers who are willing to disclose more information normally pay less interest rate (B6hme et al., 2010). Social ties will increase the chances of having the loan fully funded (Sergio, 2009; Greiner & Wang, 2009; Herrero-Lopez, 2009; Hildebrand & Rocholl, 2010; Lin 2009), reduce the ex post interest charged on the loan, and also decrease the default risk associated with the loan (Lin et al., 2009; Zhensheng, 2014). Furthermore, some research is focused on the contribution of demographic information of borrowers on loan funding such as appearance and gender. Research shows that appearance also does influence the decision of lenders to fund a loan or not (Jefferson et al., 2012). Female borrowers are less likely to get loans funded than are male borrowers. Based on all the information provided by the borrower, investors then need to determine whether to lend and how much to lend. The objective of lending money on P2P platforms is to gain high return and mitigate default risk. Investors on P2P lending platforms are inclined to invest in loans with higher ex post return, which also carry higher default risk. Assessing default risks based on previous loans' performance is another focus of academic papers. 8 There are 8 papers that built models to investigate what are the key determinants of default risk, so investors can use this as a guideline to avoid adverse selection. Loans with lower credit grade and longer terms will result in higher default risk (Riza et al., 2015). This finding is opposite from the result in this paper because in my paper, rather than using either completed loans or matured/finished loans, I used a combination of both. There are discrepancies between risk premiums charged and real default risk associated with loans on P2P lending platforms (Kumar, 2007). This conclusion is supported by the fact that the proof shows that the premium charged by P2P platforms is not enough to cover the potential loss of investors (Riza et al., 2015). Recommendations were also imposed that another way to mitigate default risk of loans is to set up a social reputation system in P2P lending platforms (Everett, 2010; Lin, 2009). Platforms will charge borrowers a loan origination fee once the loan is successfully funded. Investors will also be charged a service fee of managing installment payments from borrowers. A handful of papers were focused on building the internal information system of P2P platforms. For instance, Collier (2010) informed practice and theory on developing community reputation that can improve information asymmetry on Prosper and mitigate adverse selection. Also, as an intermediary in the financial market, platforms are regulated by both SEC and CFPB. 4 papers uncovered the current regulations on P2P lending and inform implications for further development of specific regulation for P2P lending. A multi-agency regulatory approach of P2P lending should be implemented that intimates the approach applied to regulate traditional lending (Eric et al., 2012). Borrowers need to pay monthly installment payments until the the loans reach maturity. If 9 desired, they can also choose to repay all principle payments ahead of the loan's maturity by paying a service fee. Platforms also provide a trading system to investors who want to sell holding loans with a certain discount. This trading system, like an open market, helps platforms to provide more flexibility to investors. However, some loans default in early stages of installment payments. This causes a huge loss for investors as a whole. Investors are inclined not to hire an agency to collect net principle loss due to the small amount of investment (Freedman & Jin, 2008). Further research into after-default management of P2P lending is an urgent need because it can help mitigate net principle loss of investors and improve the risk-adjusted return of platforms as a whole. 2. Market Review of P2P Lending 2.1 Market Size The potential market size of P2P lending could be measured in both micro and macro ways. The market size of P2P lending is mainly the size of unsecured loans, including unsecured personal loans and line of credit. The total amount of consumer credit in the U.S. as of Oct, 2014 is 3.283 trillion USD, as asserted by Federal Reserve G.19 release. Per the E2 Release of Federal Reserve, the total amount of outstanding business loans ranging from $10,000 to $99,000 is 3.4 billion. We can sum up above two components as the potential market size for P2P lending, which is 3.286 trillion USD purely in THE U.S. market. Currently, Prosper contributes 2 billion in fund lending, and Lending club contributes 6 billion in loans. In a macro way, we can even expand the market to the middle size business loans since lending club also provides business loans up to 300K USD. The total amount of business 10 loans ranging from IOOK to 999K is 12 billion (Donghon, 2014). Conservatively, we can add another 2.4 billion to the potential P2P lending market. This will result in a market with a total amount of 4.288 trillion USD dollars. Investors on P2P lending platforms are about to eat between 25 percent and 30 percent of the business that traditional banks are doing. The overall market of P2P lending will then grow to about $1 trillion by 2025 (Cromwell, 2015). 2.2 Key Players and Respective Ma rketplace Rank Lending Site Year Founded Loan Volume($billion) Country 1 Lending Club 2007 6 USA 2 CreditEase 2006 3.2 China 3 Upstart 2012 3 USA 4 Prosper 2006 2 USA 5 Zopa 2005 0.8 UK Lending Club. Lending club which was founded in 2007 has been paying investors $590 million in interest returns. Per the statistic data from Lending Club's websites, by 3 0th September 2014, 83.17% of Lending Club borrowers reported that they use loans from Lending Club to refinance existing loans or pay off their credit cards. The breakdown of the main purposes of Lending Club loans is shown below. 11 /J - Ct -- .' F g:ff Prosper. Prosper, founded by Chris Larsen and John Witchel on February 5, 2006, was the first P2P Lending platform in the U.S. It stays unlisted and is financially supported by several big names in venture capitals. Till now, Prosper had more than 2 million members and generated over 2 billion loans. Upstart. It was founded by ex-Googlers in 2012 in the U.S. and originated more than $3 billion in loans with an annual growth rate of 265%. The major difference that lies between Upstart and other platforms is that when assessing the credit quality of borrowers, Upstart starts with the same information but will further include academic variables to come up with the risk assessment more statistically. CreidtEase. As reported by Peter Renton in 2014, CreditEase is the largest P2P lending platform in China and has generated more than $3.2 billion USD in loans to over 500,000 borrowers. This company was founded in 2006 and is now operating in over 150 cities of China. Zopa. Zopa is the oldest Peer-to-Peer lending company in the world. The company was founded in 2005 in the UK. It has lent $1 billion USD and has helped both borrowers and investors get better rates. 12 2.3 Market Outlook of P2P Lending The emergence of P2P lending exceeded the public's expectation in recent years. P2P lending would increase by 66% to a total size of 5 billion USD by the end of 2013 (Gartner, 2010). Looking at the statistic data of the biggest platforms, I found that lending club experienced over 150% annual growth rate till 2014. Besides, Prosper.com also achieved exponential growth since its establishment. Till the end of 2013, it originated over 300 million USD in loans and moved this number to over 1.5 billion USD in loans by the end of 2014. Despite the fact that it's extremely difficult to estimate the exact growth rate of P2P lending, there are several determinants that can indicate the future trend of P2P lending from a macro perspective. 1) Geographic expansion. Till now, P2P lending is not fully authorized in all states of the U.S. due to the complexity of autonomy. Even in China, the acceptance of P2P lending varies among different regions. Further geographic expansion would be expected in the next few years. 2) More comprehensive legislation. The main reason that certain public authorities or groups are still skeptical about P2P lending is that it is still in its infancy and is less regulated compared to traditional banks. The specific regulations for P2P lending are an urgent need in the market. 3) Challenges from traditional banking. Given the fact that the P2P lending has huge cost-advantage to traditional banks, with the recovery of the U.S. economy, the government is considering loosening the requirement for loan borrowers. This will help traditional banks to regain borrowers who are not entitled to a loan. In China, many financial institutions also introduced their own P2P platforms to gain a piece of the pie. 4) Information asymmetry. Information asymmetry might lead investors to adverse selection (Akerlof, 1970) and moral hazard (Stiglitz and Weiss, 1981). Various efforts are being made in order to 13 mitigate the information asymmetry by the platforms. 5) Bottom line of the economy and employment. The performance of both the economy and employment will impact the further development of P2P lending. As the statistic data from Proper and Lending club, most of the borrowers' purpose is credit consolidation. Stronger economy and improved wages and employment rate indicate that people's financial condition will be better off and the need of credit consolidation will decline accordingly. 6) Institutional investors. P2P lending can provide a higher ROI than many other investments in the financial market. There are institutional investors who purchase loan packages from platforms to gain stable cash flow and return. A simple comparison among different financial investments is listed below. In 2013, P2P lending generated much lower return than NYSE and Dow Jones Industry Composite, but outperformed NYSE and Dow Jones in 2014. However, for P2P lending platforms, I'm using the official investment return rate while the true risk-adjusted investment return might vary from this data. Another point worth noticing is that the superior return from stock market in 2013 is due to the recovery from an economic and financial downturn. An ROI around 10% is already very impressive in the financial investment sector. As reported by Bloomberg, the average return of hedge funds was 7.4% in 2013. Investment Lending club Prosper 3yr T NYSE Dow Jones 2014 10.50% 9.79% 1.10% 4.22% 7.52% 2013 8.75% 9.86% 0.78% 23.18% 26.50% Till the end of 2014, the total amount of loans originated through P2P lending in China has reached $40 billion with a default rate of 17.46%. 1.16 million borrowers got their loans funded by 630,000 investors, and these numbers increased by 364% and 320% compared 14 with numbers of 2013 respectively. There are 1575 P2P lending platforms in China, and 275 went bankrupt in 2014, implying that one out of six platforms was not sound. The average amount of loans and money that individual investor funded is $35,000 USD and $64,000 USD. This statistics data comes from Wangdaizhijia.com in China. 2.4Business Models of P2P Lending This section will introduce the business models used by major P2P lending platforms in the U.S and China and address the major differences between the two markets. In the U.S. market, the business models of P2P lending platforms are quite similar to each other. Borrowers post their loans on platforms and investors browse and choose loans to invest. The P2P lending platform acts as an intermediary and is responsible for risk rating, determining interest rate, document verification and interest payment management. However, Prosper and Lending Club still varies in several ways as below. 1) Loan type. Prosper only originates personal loans ($2000-$35,000 USD) while Lending Club also originates business loans up to $300,000 USD and personal loans ranging from $1000 to $35,000 USD. Besides, Prosper and Lending Club provides loans with different maturities. Both provide 3-year and 5-year loans. In addition, Lending Club provides a 1-year loan as well. 2) Interest rate. P2P platforms determine the interest rate by considering information reflecting borrowers' credit quality. Both Prosper and Lending Club stipulate the cap and floor interest rate for loans falling into different credit Rating/Grades. However, Interest rate in the same credit category varies between Prosper and Lending Club due to different credit rating logic. 15 3) Credit scoring. Prosper and Lending Club provides a proprietary credit score as a major indicator of loan risk. They both offer 7 rating categories, Prosper from HR (worst) to AA (best) and Lending Club from G (worst) to A (best). 4) Origination Fee. Platforms earn money by charging fees to borrowers. The cap and floor fee rates charged by Prosper and Lending Club are the same, whereas different rates are charged for borrowers in different risk categories. A simple comparison is listed below, including credit rating, respective interest rate and origination fee. Lending Club Rating AA A B C D E HR Interest Rate Origination Fee 1%2% 6.05%'7.96% 4% 8.19%11.33% 5% 11.56%'14.06% 5% 14.59%'18.27% 5% 19%'22.68% 5% 23.44%27.04% 5% 27.75%31.25% Rating A B C D E F G Interest Rate Origination Fee %3% 5.49%'8.19% 4%-5% 8.67%11.99% 5% 12.39%'14.99% 5% 15.59%-17.86% 5% 18.54%21.99% 5% 22.99%-25.5.7% 5% 25.8%'26.06% 5) Affiliate & Referral Programs. Prosper introduces the affiliate program to attract more borrowers and lenders from referrers and to provide $100-150 USD for borrower leads and $50 for lender leads. Lending Club also introduced the affiliate & Referral program, but detailed bonuses are not provided on its website. 6) Both Prosper and Lending Club provide Notes Trading Platform, where investors can trade their holding notes with each other. Folio is a Broker-Dealer platform which only charges sellers 1%. 7) Early repayment. Borrowers can choose to pay the remaining repayment without paying any penalty, in order to refrain from paying monthly interest in the future. 8) Interest Auction. P2P lending platforms normally regulate the interest rate for loans, 16 based on the information provided by the borrowers. However, in early years, Prosper introduced interest an rate auction in which investors can bid the lowest interest rate they can accept to compete funding the most popular loans. This is the reason why sometimes we can see that the loans were originated with a lower interest rate. Prosper stopped the interest auction service in 2011 and implemented a fixed interest rate like Lending Club. In China's market, P2P lending platforms are basically following the same model as those in the U. S., acting as an intermediary between borrowers and lenders. However, due to differences of economic and legal environment, as well as the customer's behavior, there are unique features which evolved from P2P lending in China. We use Hongling Capital and Creditease as representatives since they are two of the earliest P2P platforms which originated in China. 1) Loan Type. Hongling Capital offers personal and business loans with an amount between $500 and $1,600,000 USD, with maturities between 3 months and 12 months. Creditease offers personal loans of amounts between $1,600 USD and $1,000,000 USD with maturities between 1 year and 4 years. Obviously, P2P lending platforms in China's market are more aggressive and also bear higher default risk. 2) Interest Rate. Rather than determining the interest rate based on credit score, maturity and amount as P2P platforms in the U.S., China's platforms determine the interest rates simply based on loan type or maturity, because there is no credit agency that can provide a comprehensive credit report for individuals (China's PBOC just authorized certificates for credit agency in January 2015). Hongling Capital regulates interest rate between 8% and 18% and Creditease between 10% and 12.5%. 17 3) Credit Scoring. The only credit report that a borrower can submit is the one provided by PBOC that includes the history of credit card usage and loan repayment. Platforms don't rate borrowers into different credit categories, which differs from U.S. platforms. It's a common practice for platforms to enable credits to borrowers/investors if they successfully pay the monthly payment or make investment. For instance, Hongling Capital category sorts customers into 5 categories from VI (lowest) to V5 (highest). Investors on Hongling Capital can refer to different categories as a risk indicator. 4) Origination fee. Creditease charges investors 10% of interest earnings and borrowers 10% as service fee. Rates and Fees on Hongli is more complex. Hongli charges investors from 0% to 10% as fees. This charge is determined depending on the categories, which range from V I to V5. For instance, investors in VI need to pay 10% of interest earnings as a service fee, and those in V5 don't need to pay any service fee. For borrowers, Hongli also charges various percentages on loans, as a service fee based on different loan types. The overall range is from 3% to 14.6%. 5) Affiliate & Referral Programs. Creditease doesn't pay the referral bonus, while Hongli pays $6 USD if the referred customer registers as a normal member, and $12 USD if he registers as a VIP. 6) Notes trading. Platforms in China also provide notes trading services to investors. 7) Early repayment. On Creditease, if borrowers want to pay the remaining loan earlier, besides the interest for the current month, remaining loan and service fee, they need to pay a 0.5% of the remaining loan as a penalty to the platform. Similarly, borrowers on Hongli Capital need to pay interest for an extra month as penalty if they want to pay off 18 the remaining loan earlier. 8) Principle Guarantee. The biggest difference between the U. S. and China in P2P lending is that many platforms in China introduce a 3 rd party company to guarantee the safety of investors' money, just in case any fraudulent funding happens. This is the remedy for the lack of credit score available from borrowers and platforms that will improve the confidence of investors. However, 3 rd party guarantee is not a catholicon for P2P lending in China. A certificate of Guarantee Company only costs $1 million USD and there are cases where owners disappeared with the money, leaving investors to lose all their money. 3. Data Analysis and Modeling 3.1 Introduction There are questions being addressed in this section, including 1) the distribution of PV, rate of bad loans and interest of different credit categories. 2) Whether the risk-return improves from year to year, especially when platforms change their policy. 3) Any behavior difference of borrowers and investors between Prosper and Lending Club. 4) Investigate the contribution of determinant variables to the performance of loans. 5) Build the model to determine the possibility of default using different data mining methodologies. 6) As researched by Riza, Yanbin, Benjamas and Min in 2014, the higher interest rate regulated by Prosper and Lending Club for riskier loans is not enough to reimburse the potential loss exposing to investors. This section will use a FCFF methodology to test this conclusion considering the time value of future cash flow. 19 3.2 Key Variables 3.2.1 Prosper Variable name Type Definition Credit Rating Numeric Proprietary Credit rating by P2P lending platforms Loan Status Dummy Borrower Rate Numeric Whether the loan is active, completed or default Interest rate borrower is willing to pay Borrower APR Numeric Actual rate borrower needs to pay considering service cost Lender Yield Numeric Actual rate lenders receive considering service cost Listing Category Dummy Numeric The purpose of the loan The time period of employment till the creation of listing Current Credit Line Numeric Numeric Whether the borrower owns real estate The number of credit lines the borrower owns OpenRevolvingMonthlyPayment Numeric RevolvingCreditBalance Numeric The monthly payment of revolving account The current credit balance of revolving account BankcardUtilization Numeric The percentage utilization of revolving credit balance AvailableBankcardCredit Numeric The total amount of bank card credit till the creation of the loan TradesNeverDelinquent DebtToIncomeRatio Numeric Numeric The percentage of delinquency of trades The percentage of debt to income StatedMonthlyIncome Numeric LoanOriginalAmount Numeric Monthly income stated by borrowers The original amount of loan originated Investors Numeric Terms Numeric Employment Duration Is Borrower home owner The number of investors who fund the loan The term length of the loan Both Prosper and Lending Club define "bad loans" as loans that are 60+ days past due within the first twelve months from the date of loan origination. 3.2.2 Lending Club Variables Type Definition Grade Dummy The proprietary credit rating of Lending Club loan-status Dummy int rate Numeric The current status of the loan The interest rate the borrower needs to pay Purpose emplength Dummy Numeric home-ownership Dummy open acc Numeric If the borrower owns or rents an apartment The number of open credit line of the borrower revol bal Numeric The amount of current credit balance The purpose of the loan The time length of the employment of the borrower 20 revol util dti Numeric Numeric annual inc loan amnt Numeric installment Term The current ratio of credit balance utilization The debt to income ratio The amount of annual income The amount of the loan The amount of monthly payment The term length of the loan Numeric Numeric Numeric 3.3 Distribution of Dataset 3.3.1 Prosper When depicting the distribution of loan's characteristics, we exclude current and cancelled listings that haven't completed and funded. Besides, records with proprietary credit rating "NC" are excluded due to incomplete information, and those loans were originated in early 2006 and 2007 when Prosper was in infancy. There are 113 rows of records that are missing proprietary credit rating. We assume that these records won't influence the validity of our analysis due to the small amount of records. Amount of Loans Mean Average 9,466 12,000 9,685 11,000 9,764 12,000 8,423 10,000 7,500 6,326 4,000 4,250 3,500 3,056 Successful Credit Category AA A B C D E HR Credit Category AA A B C D E HR Rate 30% 23% 25% 29% 47% 49% 76% 1vear 3% 3% 3% 2% 2% 3% 0% Number of Loans 6,487 10,479 12,023 14,892 15,259 10,286 8,846 Term 3 years 93% 86% 79% 76% 83% 87% 100% Total 61,402,940 101,490,254 117,411,802 125,436,437 96,539,254 43,717,649 27,031,067 STDEV Default Rate 11% 6,664 16% 6,664 22% 8,345 28% 7,044 31% 5,853 37% 2,629 46% 1,323 5years Interest rate $/investor Credit Score 791 53 4% 8.9% 738 73 12% 11.4% 712 15.4% 87 190/ 104 682 22% 18.9% 667 91 23.6% 15% 640 10% 28.3% 103 621 29.3% 89 0% 21 There are several features of the dataset distribution of Prosper. 1) Surprisingly, the successful rate of a listing being funded to be a loan decreases when credit worsens. This might be caused by the higher interest rate paid by worse credit rating. 2) The majority of loans are from C and D, consistent with our expectation that the major loans on Prosper (even most of the P2P lending platforms) came from borrowers with poor credit record. 3) From the best credit rating to the worst, the average and medium amount of the loan is declining continuously, majorly because the limitation placed by P2P platforms. 4) The default rate climbs when credit getting worse. The default rate of A-loan is 11%, while 46% for HR-loan. 5) As we expected, interest rate increases when credit quality declines. An assessment will be done in the following section to test if the interest rate advised by Prosper is enough to cover the potential loss. 6) There is a trend that for loans with poor credit rating, investors tend to place more money on each investment. Number of Loans 12,000 18,000 16,000 10,000 14,000 12,000 8,000 10,000 6,000 8,000 NO. of Loans -- 4,000 6,000 4,000 2,000 2,000 0 AA A B C D E HR 22 Ave rage Amount Borrower Rate vs. Prosper Rating 0.3443 03288 03125, 0299 02863 0.2745 0.2623 0.2521 0.2417 0232 02225 0.2127 0.2025 0.1932 3t 0.1839 0 0.1753 0:1679 0.1587 0.1495 0,1424 0.1338 0.1248 0.1162 0.1075 0.0985 0.0911 0.0813 0.0714 0:0623 0 - h A B AA Smooth(Borrower Rate) C E D HR Prosper Rating Percentage of Total Loans by amount Year AA A B C D E HR Default Rate 2006 2007 2008 2009 2010 2011 2012 2013 2014 7.5% 15.4% 23.3% 21.6% 16.1% 7.3% 7.3% 4.6% 6.8% 7.7% 16.8% 19.5% 24.9% 20.9% 17.5% 17.7% 16.5% 18.3% 9.3% 19.9% 23.2% 6.9% 14.7% 16.9% 18.1% 24.5% 24.0% 11.2% 21.3% 17.4% 17.9% 9.0% 9.1% 22.8% 31.2% 29.4% 9.8% 15.3% 11.2% 13.6% 19.5% 27.1% 18.9% 14.9% 1.4% 8.9% 6.2% 3.0% 5.2% 8.3% 16.7% 5.4% 6.8% 6.5% 45.6% 5.2% 2.5% 9.8% 11.5% 5.5% 9.9% 1.5% 13.6% 39.2% 39.5% 33.0% 15.2% 16.7% 22.6% 31.2% 23.6% 24.5% 7) Year by year, more investors switch to riskier loans from A or AA classes, especially to loans in B and C. This trend might be caused by investors seeking higher interest rate as well as the improved loan default rate under each credit category. 8) Both the overall default rate and the default rate for each credit category decreased continuously. However, investors are becoming more risk-averse. This improvement can be explained by the effort that Prosper is better off in risk screening and verification. (When calculating the default rate, loans that originated after Q2 2014 are excluded from the dataset, because no loans could be past due more than 60 days, and when they do, they are considered as default) 23 Default rate YoY Year AA A B C D E HR Overall 2006 8.8% 16.7% 24.7% 36.2% 35.8% 48.8% 64.8% 39.2% 2007 14.3% 25.8% 33.3% 41.1% 42.8% 53.2% 62.2% 39.5% 2008 18.3% 25.6% 32.9% 33.4% 37.4% 43.6% 52.5% 33.0% 2009 6.0% 9.3% 16.8% 15.4% 22.4% 22.3% 23.7% 15.2% 2010 3.9% 9.8% 11.2% 15.3% 21.4% 24.9% 25.4% 16.7% 2011 2.9% 9.4% 15.5% 14.9% 24.8% 32.1% 31.0% 22.6% 2012 8.1% 9.3% 14.1% 20.1% 23.9% 25.9% 28.5% 31.2% 2013 4.1% 2.8% 4.6% 7.5% 10.8% 13.1% 13.6% 23.6% 2014 8.7% 0.4% 0.7% 1.2% 1.6% 2.5% 1.7% 24.5% 3.3.2 Lending Club Amount of Loans Credit Category Successful Rate Number of Loans Total Average STDEV Default Rate A B C D E F G 32.6% 28.8% 26.7% 28.3% 29.1% 33.6% 33.6% 20,076 33,882 27,641 17,980 8,484 3,772 916 213,245,525 402,115,200 352,094,900 246,222,500 148,964,150 73,021,450 20,171,950 10,622 11,868 12,738 13,694 17,558 19,359 22,022 6,586 6,861 7,769 8,426 9,505 9,225 8,417 8.5% 17.2% 24.2% 30.8% 36.4% 43.5% 43.2% 1) There is no significant difference of successful rate listing being funded across different credit categories in Lending Club, 2) Loans are more concentrated on good-credit loans from A to D in terms of number of loans and total amount. 3) What is different from loans on Prosper are lower-credit loans on LC which tend to have bigger amount than higher-credit loans. This is an indicator that LC considers amount as a contributor when rating loans. 4) There is no significant switch of investors' risk aversion year by year on lending club. 5) The default rate of LC is much lower than Prosper in each year and under each category, but this doesn't mean that the overall risk return that Lending Club generates is higher than Prosper as a whole. More detail will be interpreted in the 24 following sections. 6) Interest rate for loans among the same credit rank on LC and Prosper is similar. 7) There is a trend of improvement regarding default rate from 2007 to 2010. I don't involve years after 2011 into consideration since most loans are still under regular payment process, whereas for loans originated in early years, most of them are either fully paid or went default. Percentage of Loans by credit grade-LC Year A B C D E F G 2007 2008 2009 2010 2011 2012 2013 2014 22.7% 18.9% 25.0% 24.3% 26.5% 20.4% 13.1% 14.2% 24.3% 32.5% 28.9% 30.7% 30.2% 34.7% 32.7% 26.6% 29.9% 28.0% 25.3% 21.4% 18.1% 22.3% 28.3% 28.1% 14.7% 14.2% 13.9% 14.0% 12.9% 13.7% 15.3% 18.9% 5.6% 4.8% 5.0% 6.9% 8.0% 6.0% 6.7% 8.7% 2.8% 1.3% 1.4% 2.1% 3.3% 2.5% 3.3% 2.8% 0.0% 0.3% 0.5% 0.8% 0.9% 0.5% 0.6% 0.8% Default Rate YoY-LC Year A B C D E F G Overall 2007 2008 2009 2010 2011 2012 2013 2014 1.8% 5.8% 6.7% 4.7% 6.6% 6.3% 1.7% 0.5% 13.1% 14.6% 11.4% 11.1% 11.5% 11.0% 4.4% 1.1% 18.7% 17.8% 14.8% 14.5% 16.8% 15.1% 7.4% 1.8% 40.5% 24.3% 17.4% 18.6% 20.9% 19.1% 10.8% 2.8% 35.7% 16.0% 21.6% 22.5% 23.8% 23.4% 12.8% 3.8% 28.6% 47.6% 17.2% 30.0% 28.1% 25.6% 17.0% 5.8% 0.0% 50.0% 34.8% 28.4% 31.5% 30.7% 16.6% 5.8% 17.9% 15.8% 12.6% 12.6% 14.1% 13.2% 6.9% 1.9% Number of Loans by Risk Category 25 Number/Amount of loans 40,000 35,000 30,000 25,000 20,000 Number of Loans 15,000 -U-Average amount 10,000 5,000 A B C E D F G Interest Rate Range by Risk Category 02509 0.24S 0.2352 0.229 0.2215 02159 Column 2 vs. Column 1 Smooth(Colu.m. 2) 0.1939 0.1891 0.171 0162 U.324 0.1261 0.12183 0.1172 0.1141 - als 014. .40.1426 00432 0.0781 0.0692 Credit Grade 3.4 Model Building and Interpretation-Lending Club This section contains five steps. First, prune the datasets of Lending Club and Prosper for the model building. Second, select variables and build the logistic model to predict the default probability. Third, try to interpret the significance of each variable and compare the estimates with the expectation. Fourth, Choose alternative data models to predict the loan status, as 26 well as net profit/loss, and try to compare the result with conclusion made by logistic regression. Last, as a robustness check, I will test the linear assumption between predicting variables and target prediction, and try to explore the nonlinear relationship between target prediction and each individual predicting variable. 3.4.1 Data Preparation In the data preparation, I tried to only incorporate parameters that can be somewhat verified. There are definitely some variables such as loan purposes that borrowers can fabricate subjectively. Even though we can build a model with a good performance using those subjective parameters, the reliability of the model is questionable. 1) Homeownership. The original options for this variable include "rent", "own", "Mortgage", "None", "Other". We create dummy variable, considering 1 as "own" or "mortgage" and 0 for the rest. Answers of "own" and "Mortgage" are considered as 1, and the rest as 0. 2) There are over 300,000 rows of data; all current listings are excluded from the dataset since we're aiming to detect any indicators of risks from an investor's perspective. 3) Loan Status is the target to predict. Loan status. Loan status of "0" represents active loans that already finished all payment or that are still in payment process. "I" represents default loans including charged-off, default, or delinquencies more than 31 days (since there are only two categories for delinquent loans, less or equal to 30 days or more than 31 days). Initially, there are 87880 "completed" loan listed on Lending Club, while my interest is to look at loans that either finished all payments or declared default already. Keeping that in mind, I further split completed loans into two categories - paid and in-process. Within completed loans, there are only 5509 loans that already finished all 27 payments. The remaining 82371 completed loans are still in payment process. However, as shown in the below graph, 50% of bad loans declared default before Ih month. Or 75% of bad loans declared default before 171 month. This implies that within those 82371 loans that didn't finish all payments, there is a great chance that they will eventually pay off all installments. Therefore, in order to provide a reliable data model and mitigate bias toward completed loans, I treat completed loans that have paid at least 17th installments as finished loans, and assume that they won't go default in future. By doing this, I get 38555 good loans (finished all payments) and 24871 bad loans (default or charged off). 65 60 60 NO. of Month Paid vs. loan status 3 NO. of Month Paid 00 55 00 50 45 40 35 30 Z)25 20 15 10 0 1 0 loan_status 4) Income verified. "0" represent that the income is not verified while "1" means income verified. 5) Independent variables involved in the regression: Loan amount, term, employment length, homeownership, annual income, if the income is verified, debt to income ratio, FICO credit score, open account, revolving credit balance, the utilization ratio of revolving credit balance, total account. I excluded the variable "purpose" from the model due to the 28 low reliability of the value that borrowers put when they applied for the loan. 6) The whole dataset will be divided into training and validation. The whole dataset is randomly partitioned into 43426 training rows and 20000 validation rows 7) Profit/Cost matrix. I need a cutoff value in order to classify the predictions into 0 or 1. To do that, I need to compute firstly the profit/cost matrix for Lending Club. There are 63426 loans in the dataset, including 38555 good loans and 24871 bad loans. Good loans generate $108,339,408 out of the total original amount $450,364,975, representing a ROI of 24.1%. Bad loans cost investors a total loss of $219172141, out of the total original amount $350771625, representing a negative ROI of 62.5%. Finished loans as a whole causes a loss of 110,832,732 out of the total amount $801,136,600, representing negative ROI of 13.8%. You might be surprised that the real ROI that Lending Club offers to investors is actually much lower than the one it advertises on the website. The profit/cost matrix should be as below. Profit Matrix Actual Predicted Loan Status 0 1 0 1 -1 1 -2.6 0 3.4.2 Model Building Before building the model in each step, I selected variables based on R-Square, AIC and BIC rules. Then I compared the performance of models using different variable combinations. 1) R-Square oriented stepwise selection intends to remove open acct from the model. 2) A minimum AIC recommend further removing home-ownership from the data model. 3) 29 Selecting to use Minimum BIC also gives the same result of excluding open acct and homeownership from the model. Detailed results are listed below. Entered [X] [X] [X] [X] [XI [XI [XI [XI [XI [X] [X] Entered Maximize Rsquare Parameter Intercept[1] loanamnt term emplength homeownership annualinc isincv dti FICOScore openacc revolbal revolutil Minimum AIC Parameter Sig Prob 1 8.30E-70 3.00E-233 5.00E-15 0.51441 1.30E-41 6.81E-09 3.20E-84 0 0.88003 3.76 E-09 4.57 E-06 Sig Prob I [X] [X] [X] [X] Intercept[I] loanamnt Term emplength home ownership [X] [X] [X] [X] annualinc isincv Dti FICOScore open acc 8.30E-70 3.OOE-233 5.OOE-15 0.51441 1.30E-41 6.81E-09 3.20E-84 0 0.88003 [X] [X] revolbal revolutil 3.76E-09 4.57E-06 Entered [XI [XI [XI [X] [X] Minimum BIC Parameter Sig Prob Intercept[1] 1 loanamnt term emplength homeownership 8.30E-70 3.OOE-233 5.OOE-15 0.51441 1.30E-41 annualinc 30 [XI isincv 6.81E-09 [XI [X] dti FICOScore open_acc [XI [X] revolbal revolutil 3.20E-84 0 0.88003 3.76E-09 4.57E-06 Based on the result from data selection, I ran the logistic regression Estimates of parameters under slightly different variable combinations are listed below. There is no significant value or sign difference between the two results. Besides, RSquare-oriented variable combination offers a RSquare of 0.2135, while AIC/BIC selected variable combination gives only a slightly lower RSqure -- 0.2134. Estimate Term Maximize Rsquare Minimum AIC/BIC Intercept loanamnt Term empjength homeownership annualinc isincv Dti FICOScore revolbal revolutil -10.66162 -0.00003 -0.03942 -0.02573 0.01513 0.00001 -0.13985 -0.03298 0.01900 0.00001 0.21735 -10.67306 -0.00003 -0.03937 -0.02533 N/A 0.00001 -0.13967 -0.03296 0.01902 0.00001 0.21590 Since the model using parameters selected by RSquare stepwise offers slightly better result, I computed the formula as below accordingly. 1 P(Default) = 1 + eO-(-0.66162+PiXi) fli: Coeff cient of parameter X1 : Parameters The confusion matrix generated from two combinations is listed below. Both models achieve 31 the best performance under a cutoff value of 0.44, meaning that if the default probability equals to or is bigger than 0.44, the loan will be determined as default, vice versa. The overall accuracy rate of the two combinations is close to 69.1% for RSqure combination and 68.8% for AIC/BIC. The former one does a better job in identifying good loans, while the latter one is more accurate in identifying bad ones. Both combinations can improve the overall ROI of Lending Clubto negative 1.2% by AIC/BIC combination and to negative 1.7% by RSquare combination. Even though the risk return after enhancement is still negative, a progressive step has been made by imitating 12% loss. Not surprisingly, there is a price paid to improve the overall risk adjusted return to investors. Applying this model means the overall volume of loan origination will decline by 37.8%, while this improvement in risk adjusted return can help amass the credit worthiness for P2P platforms and attract more investors thus borrowers in the long run. Confusion Matrix-RSquare Actual Predicted loan Status 0 1 0 9180 3256 1 2923 4621 Confusion Matrix-AIC/BIC Actual Predicted loan Status 0 1 0 8959 3099 1 3144 4778 3.4.3 Model interpretation In this section, I will analyze the estimates of parameters concluded in model building, and compare the result with business intuitions held by the public. To make the interpretation more clear, when a 32 parameter is claimed to have a positive impact to default rate, it means the higher the value the parameter have, the higher default probability the loan involves, and vice-versa. Several papers also tried to interpret the impact of parameters. FICOScore has a negative impact to default rate, while debt-to-income ratio and credit line utilization have a positive impact (Riza, Yanbin, Benjamas and Min, 2015). However, when looking at the result from the model that only included the finished loans, some of estimates of variables are not intuitive. This section will start from interpreting variables that are counter-intuitive with our expectation, and then go through those that match the expectation. 1) "Loan amnt" has a negative impact to the default probability. Normally, a higher Loan amnt gives people an image of involving higher risk, while it turns out that this is not the case. 2) The same to "term". There are two time length allowed on Lending Club - 36 and 60 months. Generally speaking, given all the other features constant, 60-month loan doesn't contain a higher default risk than 36-month. This might explain that Lending Club only approves a longer term loan if the borrower is more qualified. 3) "Home_ownership". Owning a real estate doesn't necessarily mean that you're more credit worthy. It's actually the opposite. 4) "Annualinc". A higher income put by the borrower when applying for a loan won't guarantee a better consequence. The impact of this variable should be considered with " is incv", which has a negative impact to the default rate. 5) "dti-debt" to income ratio. This ratio also has a negative impact to the default rate. This impact could be explained that some income information of borrowers is fictive. Further research in the paper will only include loans with verified income to detect any different result. 6) One most surprising finding is that "FICOScore" has a positive impact to the default rate. People might think that borrowers with higher FICOScore normally have better credit quality, since the credit score backed by a 3 rd party agency is normally very reliable. However, on Lending Club (and also later mentioned in Prosper's model), 33 FICOScore is not a good indicator of the credit quality. Lenders can't simply make the decision based on this score, which is actually what lots of investors are doing. 7) "revol_util" and "revolbal" have positive impact to default rate, which is consistent with expectation. Because the majority of borrowers on Lending Club are applying for loans to coordinate personal credit lines, a higher balance and utilization ratio indicate a higher financial pressure of paying back the balance. 3.4.4 Robustness Check Besides building the model to predict nominal target parameter, I also considered using the same predicting variables to predict the numeric parameter-net profit/loss, to check the numeric regression outperforms logistic regression. The same as the previous section, I prune the predicting variable combination oriented by RSqure, AIC and BIC and list the result below. Three ways to rule out variables give the U.S. the same result-to keep all variables in the linear regression model. Entered Parameter Estimate [XI [XI [XI [X] [X] [X] [X] [XI [XI [XI Intercept loan_amnt term emplength annualinc is_inc_v dti FICOScore revolbal revolutil -13687.535 -0.1754355 -106.27022 -44.95143 0.00523282 -239.96839 -96.287356 27.2358601 0.01584572 1126.27626 Looking at the estimates of variables in a linear regression, it makes more intuitive sense than the result from the logistic regression. For instance, "loanamnt", term and" dti" have a negative coefficients with net profit in a sense that the higher value the variables have, the lower profit or higher loss that the loan will cause investors. By contrast, FICO_Score, and annual_ inc place positive to the loan's net profit/loss. The model generates an RSquare of 0.1072, which is significantly lower 34 than the value by logistics model. To further test which model is superior to the other one, I also draw the confusion matrix for linear regression model by setting up a profit/loss value as cutoff of good or bad loans. Under a cutoff value of net profit/loss of negative $2,100, the model achieves the highest accuracy of 67%, which could be further broken down to 74% of identifying good loans and 55% accuracy of identifying bad loans. However, the performance of this model is still worse than the logistic model. Confusion Matrix-RSquare Actual Predicted loan Status 0 1 0 9152 3422 1 3146 4258 The different coefficient of the same parameter to default probability and net profit can be understood by twofold way. First, the amount of net loss outweighs that of net profit significantly, therefore the positive impact imposed by FICOScore or annualinc can't bring enough profit to push the net P/L to positive numbers. 2) However, it's true that higher FICOScore and annual inc can reduce the net loss if loans go default, and can also increase the positive return if loans are proved to be good. I also used discriminant and neural network to classify good and bad loans and got confusion matrix listed below. Literally, both models outperform logistic model in the overall accuracy and net profit if applying the cost matrix to the results below. The overall accuracy of discriminant is 68% with a further breakdown of 70% accurate for good loans and 65% for bad loans. Using neural network, the accuracy turns out to be 69%, with 76% accurate for good loans and 59% for bad ones. However, there are two key disadvantages of discriminant and neural network. One is that the structure of the model is non-transparent and user can't interpret the importance of each parameter. Investors can't apply the model easily when making investment decisions. Another disadvantage is both model need 35 to be changed dynamically whenever there is a new data entering the original dataset. Confusion Matrix-Discriminant Actual Predicted Loan Status 0 1 0 8608 2680 1 3690 5000 Confusion Matrix-Neural Network Actual Predicted loanstatus 0 1 0 10268 3932 1 2030 3748 Another key robustness check is to test the assumption of a linear relationship between predicting parameter and target prediction (P/L). To do that, I test the optimal structural relationship. I use RSquare as the rule to judge the optimal exponentiation. Detailed results are listed below. Predictor Formula Rsquare LoanAmnt Term emplength homeownership Annual inc isinc v dti FICOScore open acc revolbal revol uti Quintic Linear Quintic Linear Logistic 3P Linear Quintic Quartic Quartic Quintic Logistic 3P 0.0567 0.053 0.0043 0.0005 0.0019 0.0172 0.0244 0.0169 0.0077 0.0082 0.0055 After having the formula of each parameter, I return firstly to linear regression model by using newly formularized parameters together to predict net Profit/loss, and compare with the previous linear regression model to check if the performance is better off. Coefficients and selection result are listed below. Entered Parameter 36 Estimate [X] [X] [X] [X] [X] [X] [X] Intercept Loanamnt Term Emplength Annual income dti 2 FICOScore 2 open acc 2 [X] [X] [X] [X] Revolbal Revoluti homeownership isincv 2103.486 0.777 0.612 0.404 -0.555 0.778 1.021 0.000 -0.124 -0.687 174.240 -96.392 By using new formularized parameter linear regression model achieves the best performance under a cutoff value of negative 2300 loss, meaning that loans with a potential loss that equals or are bigger than 2300 will be marked as default, otherwise as good loans. The overall accurate rate equals to 68.4%, with an accuracy of 80% of good loans and 50% accurate in identifying bad loans. This model outperforms previous linear model by subtle advantage, while it is still not as good as discriminant or neural network. Confusion Matrix-RSquare Actual Predicted loan Status 0 1 0 9828 3843 1 2470 3837 I further tried to use the newly formularized parameter to predict loan status by using logistic regression, and received the below result. This model outperforms the previous logistic model by 16%, with an overall accurate rate of 70%. I haven't went extra miles to explore if the discriminant and neural network using new parameters, but a reasonable guess would be the performance of these two models will also improve if doing so. Testing formula or additional data structure uncovers the necessity of investigating the nonlinear relationship between predictors and target parameter. 37 Confusion Matrix Actual Predicted loan Status 0 1 0 9609 3377 1 2689 4303 Risk Premium by Lending Club Another important topic is to assess if the interest rate charged to borrowers on Lending Club is enough to compensate the potential loss of investors. To achieve this, I used IRR and FCFF to compute the rate and PV for each loan (including current loan). When using IRR, we also need to estimate the number of terms that investor can receive installments on average, and then combine it with the probability of default for each loan. For FCFF, it's important to find the proper discount rate for each loan. Due to the distinctions among loans, the discount rate that needs to be used is also identical. The higher the risk, the higher the discount rate should be. I used the interest rate computed from the regression model as the discount rate with the possible number of terms that investor can receive installments. 3.5 Model Building and Interpretation-Prosper 3.5.1 Data Preparation There are 230448 rows in Prosper's dataset, including 151903 current loans and rest are either completed or default loans. Even though the data layout of Prosper is slightly different from Lending Club, major variables are still available for Logistic Regression and I will compare features between Lending Club and Prosper in the end of this section. We will build the prediction model for Prosper, following the same rule with Lending Club, and interpret and visualize what we conclude from the model. 38 1) We also take out all current loans and just look at the completed and default loans of Prosper. In addition, I checked if the one specific loan has completed its payment terms by cross checking column "month since loan origination" and "term". Given this purpose of only analyzing finished loans, I also excluded all loans with a status of "completed", "past due <15 days" or "15 days<past due<30 days," but haven't paid installment for the 15 th month. Detailed rational will be explained in the model building section. 2) Besides, when filtering data, we also exclude variables missing input for many records. For example, for the variable "occupation", there are over 8000 rows of missing data. To keep the integrity and validation, it's better to exclude this variable from our model. 3) 649 out of 78544 loans don't have a credit score. However, credit record from the 3rd party agency is one of the most critical variables in this paper. In order to maintain the distribution of the credit score in all completed loans, I also assign credit scores by following the overall distribution. 4) FICOScore. The same with Lending Club, the credit score recorded on Prosper is in a range-style. In order to build the model, we need to transfer the range into a specific number. We use the mean number as independent variable in this section. 5) Credit line. There are three credit line related variables: current credit line, open credit line and credit line in past 7 years. Due to the large amount of missing value of thr current credit line and an open credit line, I only use "credit lines in the past 7 years" as numeric variable. 6) An obstacle facing the model building for Prosper lies in the missing data in Bankcard utilization rate. Since a major purpose for borrowers applying loans on Prosper is credit 39 consolidation, knowing the credit balance utilization is critical to predict the default rate. However, there are 10% of records missing the value for this bankcard utilization percentage. Besides, there are also many missing values of variable "debt-to-income ratio". For missing value of "debt-to-income ratio", I divided the monthly stated income by the monthly payment required by the prosper. This will give a proper approximation to the real bankcard utilization ratio. In terms of the missing records of "bank card utilization", I replace blanks using "monthly revolving payment" divided by "monthly stated income". However, there are 44 records with a stated monthly income of "0" and I assign a bank card utilization ratio "1" to those records. 7) For loan status, I mark 0 as completed trades and I as default/charged-off. "Completed", "Past due (1-15 days)", "Past Due (16-30 days)", "Past Due (31-60 days)" and "FinalPaymentInProgress" are marked as 0. Others are marked as 1. 8) Another variable catching my attention is "stated monthly income". There are obvious outliers in this variable, which are those borrowers stating a monthly income higher than $15000 per month (some even as high as 60k each month). By contrast, some borrowers only have a monthly income less than 1 dollar. Eventually, this variable was excluded from my model. 9) Debt-to-income ratio. This variable shows the ratio of monthly installment over income. Looking at the distribution of this variable, there are also outliers such as 650 records with a ratio higher than 1. It makes no sense that a borrower can repay the monthly payment if it is higher than his or her income level. In addition, this should be a high-risk indicator that the platform shouldn't pass the loan application from those borrowers in the 40 first place. Those outlier records were also excluded from my model. 10) LP Net principle loss. This column assesses the net loss if the loan goes default. Within those default/charged-off loans, there are 42 records with a negative net loss, indicating actual no net loss due to payment collection afterwards. Therefore, I replace the negative amount with "0". There are also loans 'past due over 60 days" which are not recorded as net principle loss because they haven't been officially marked as default loans. In my model, I treat those loans past due over 60 days as default, thus calculated the net principle loss by computing the difference between the loan origination amount and customers' payments. 11) After pruning the dataset in the ways listed above, I get 53637 rows of records in total by only including loans that either finished all payments or claimed to be default or charged off. 12) Interestingly, there are 200 loans originated after the 3 rd quarter of 2014 and are already marked as either default or past due over 61 days. 13) In order to avoid over-fitting in my model, I partitioned the whole dataset into 33637 rows of training data and 20000 rows of validation data. The optimal model selected is aiming to maximize the accuracy of prediction for validation model. 14) Cost/Profit Matrix. In order to compute the cost/profit matrix of identifying bad loans and good loans, I first need to calculate the total return and ROI of bad and good loans respectively. For good loans amounted to a total size of $183,514,490, the total net return is $40,276,694, representing a 21.9% return rate. For bad loans sized at $147,109,890, the total net loss is negative $106,944,809, representing a 72.7% loss. Prosper as a whole 41 generates a negative 20.2% return for investors in terms of all finished loans. Therefore, the cost matrix should be as below. Cost Matrix Predicted Actual Loan Status 0 0 1 1 -1 1 -4 0 However, this is only the result of purely looking at all finished loans. I realized that this result might exaggerate the portion of bad loans because the bad loans tend to become default in the early months of their payment term. The below distribution of default months proves my hypothesis. Default month vs. Term 56 54'1 52: 50 48, 46Te 44~ 42' 40: 38a 346 c 32;h 30 2 28 75 26 S24- Default a month - o22 20-' 18 16: 14 81 -2' 12 36 60 Literally, loans of 12-month term went default no later than the 1 1 month, most likely before 6 months. For bad loans with terms of 36 or 60 months, they tend to went default no later than the 30 * month, 75% of them went default before 15 months. Therefore, a more fair way to compute the cost matrix is to also take consideration for those "unfinished" loans that have successfully paid more than the average default-month payment. For instance, for completed loans past their Il* months, I treat them as good 42 loans. The same logic is applied for 36 and 60-month loans. After incorporating unfinished loans, which I consider a high possibility of being a good loan, I had a new cost matrix. All good "finished" loans generate a return of $88,705,490 out of $297,814,413, representing 29.8% ROI. For bad loans, the total loss sum up to $105,208,364 out of $1,471,098,890, representing a loss rate of 72.5%. Prosper as a whole generates negative 3.7% net loss for investors. Therefore, the profit matrix of identifying good or bad loans is listed below. The purpose of the following model building is to find a model to optimize the overall profit. Profit Matrix Actual Predicted Loan Status 0 1 0 1 -1 1 -2.4 0 3.5.2 Model Building After having the cost matrix and a dataset pruned, I ran different data models in order to find an optimal way to distinguish a bad loan from good ones in order to minimize loss for investors. This section starts from Logistic regression and also explored other model later on. After selection in the previous section, there are 12 independent variables left in our model-" Term, IsBorrowerhomeowner, TotalCreditlinespast7years, OpenRevolvingMonthlyPayment, CurrentlyinGroup, BankcardUtilization, DebtToIncomeRatio, LoanOriginalAmount". 43 FICOScore, OpenRevolvingAccounts, IncomeVerifiable, Before building the model, I firstly select variables based on different standards, including highest R square, minimum AIC or BIC. I haven't used P value threshold since I hold the position that subjectively select the P value threshold to.enter or remove variables from the model is unreliable. 1) In R square oriented selection"creditlineinpast7years" is suggested to be removed from the variable combination. 2) In the methodology of minimum AIC, I got the same result as in step 1, to remove "creditlinesinpast7years". 3) In the methodology of minimum BIC, besides the variable identified to be removed in former two steps, "openrevolvingmonthlypayment" was also removed from the combination. All selection details are listed below. Model will be built by using different variable combinations. Highest RSquare Entered Parameter Sig Prob [X] [X] [X] Intercept[1] 1 4.70E- 17 0 0.16464 7.11 E-06 0.09087 7.30E-20 2.1 OE-56 3.20E-93 4.OOE-107 1.OOE-111 [X] [X] [X] [X] [X] [X] [X] IsBorrowerHomeowner{l 1-0} FICO Score TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment BankcardUtilization DebtTolncomeRatio IncomeVerifiable{ 1-0} LoanOriginalAmount Term Minimum AIC Entered Parameter Sig Prob [X] [X] [X] Intercept[1] 1 IsBorrowerHomeowner{ 1 -0} FICO Score TotalCreditLinespast7years 4.70E- 17 0 0.16464 OpenRevolvingAccounts OpenRevolvingMonthlyPayment 7.11E-06 0.09087 [X] [X] 44 [X] [X] [X] [X] [X] BankcardUtilization DebtToIncomeRatio IncomeVerifiable{ I -0} LoanOriginalAmount Term 7.30E-20 2.1 OE-56 3.20E-93 4.00E- 107 1.OOE-111 Minimum BIC Entered Parameter Sig Prob [X] [X] [X] Intercept[1] IsBorrowerHomeowner{l -0} FICO Score TotalCreditLinespast7years [X] OpenRevolvingAccounts OpenRevolvingMonthlyPayment I 4.70E- 17 0 0.16464 7.11 E-06 0.09087 [X] [X] [X] [X] [X] BankcardUtilization DebtToIncomeRatio IncomeVerifiable{1-0} LoanOriginalAmount Term 7.30E-20 2.1 OE-56 3.20E-93 4.OOE-107 1.OOE-111 I. I firstly use the variable combination selected by using R-Square methodology and minimum AIC. Variable estimate and confusion matrix are listed below. This model generates a RSquare of 0.1291 for validation dataset. The overal error rate is 33.7%, with a breakdown of 27.5% error rate for good loans and 46.5% error rate for bad loans. Applying the profit matrix to maximize the overall profit, I detected a cut-off value for Probability-35% (If the predicted probability is bigger than 35%, the model identifies the loan as bad, otherwise as good). Term Estimate Std Error Intercept IsBorrowerHomeowner[0] FICO Score -4.85106 0.09597 0.00982 0.13086 0.01155 0.00018 45 OpenRevolvingAccounts OpenRevolvingMonthlyPayment BankcardUtilization DebtTolncomeRatio IncomeVerifiable[0] LoanOriginalAmount Term 0.01642 -0.00006 0.29713 -0.97204 -0.43483 -0.00005 -0.02926 0.00307 0.00004 0.03254 0.06161 0.02111 0.00000 0.00131 Confusion Matrix Actual Predicted Loan Status 0 1 0 9595 2970 1 3634 3417 II. Next, I further removed "OpenRevolvingMonthlyPayment", as indicated by minimum BIC selection. This combination generates a RSquare of 0.1363, slightly lower than the one by former two. Variable estimate and confusion matrix are listed below. The overall error rate of this model is 33.7%, with a breakdown of 27.5% error rate for good loans and 46.5% error rate for bad loans. The model achieves the best performance at a cutoff value of 35%. Term Estimate Std Error Intercept IsBorrowerHomeowner[0] FICO Score OpenRevolvingAccounts BankcardUtilization DebtTolncomeRatio IncomeVerifiable[0] LoanOriginalAmount Term -4.84518 0.09907 0.00982 0.01377 0.28383 -0.98470 -0.43614 -0.00005 -0.02920 0.13078 0.01139 0.00018 0.00261 0.03150 0.06114 0.02110 0.00000 0.00130 Confusion Matrix Actual Predicted Loan Status 0 46 1 9593 2968 0 1 3636 3419 Removing variables of credit lines and revolving monthly payment doesn't improve the accuracy of the model. However, it's more convenient to predict the loan status using as few variables as possible. A formula to predict probability of a loan going default is concluded as below. P(default) = 1 1 + e--.4 8+ ii f: Estimate of predicting variables X: The value of predicting variables i: Identifier of predicting variables 3.5.3 Model interpretation Based on the confusion matrix concluded in the previous section, the model can correctly predict 72.5% good loans and 53.5% bad loans. Applying this matrix into the real P/L statistics of Prosper, this model can improve the overall P/L of Prosper from negative $16,502,874 to a positive return $15,435,037, representing a ROI of 5.4%. This result implies that investors can gain a positive return if they can fully diversify their portfolio. There are several features derived from the model as following. I interpret variables in a sequence from "unexpected" to "expected" results. 1) Unexpectedly, FICOScore has a positive contribution to the default probability of the loan, denoting that higher credit rating score can't guarantee a better performance of the loan. This finding is surprising because normally result from 3 party credit rating agency should be a good indicator of the loan quality. In order to ensure that this finding is valid, I also looked at the distribution of 47 FICOSore for good and bad loans and the box plot was shown below. Good loans have a higher FICOScore in the 25%, mean, 75% and highest score than bad loans. However, there are outliers for both good loans and bad ones that might intervene the accuracy of FICOScore. Unfortunately, even after removing all outliners in terms of FICOScore, it still contributes positively to the default probability. FICO Score vs. LoanStatus 900 Q 820 800 780 760 740 720 700 680 660 S640 8 620 Y600 580 560 540 520 50048D 460 440 420 400 380 360 FICO Score - 880 860 840 1 0 LoanStatus Also, as part of the robustness check, I ran the covariance report among variables. Detailed report is listed below. 2) Owning a real estate is not a sign of the good credit quality. 3) Higher "DebtTolncomeRatio will reduce the default risk of loans. 4) "LoanOriginalAmount" imposes a negative impact toward the default risk. 5) In line with the expectation, open revolving accounts and bankcard utilization place a positive impact to the default rate. More revolving accounts the borrower has, or higher ratio of bankcard utilization, indicates a lower credit quality and a higher default probability. Because the majority of the purpose that borrowers apply for loans on Prosper is credit coordination, which means they borrow money from Propser to offset the due balance on their revolving credit line. 6) IncomeVerifiable. Although it's unclear how Prosper verifies the personal income, the credibility will increase if 48 the borrower's income level could be verified. 3.5.4 Robustness Check Robustness check will be conducted in this section. As a start, covariance is assessed among variables to ensure a high independence for each variable. A detailed report is attached as appendix Exhibit I. As shown in the table, there is no obvious co-relation among independent variables. Furthermore, I ran several other models to compare the performance of different models and reliability of assumptions. 1) Linear Regression Model. I firstly tried a linear regression model using profit/loss amount as target variable. The intention is that after having the model, investors can choose a reasonable cutoff value based on their risk aversion level. Since the procedure of building the model is similar with logistic regression, concrete information won't be provided here. The RSquare is as low as 0.04 for validation dataset, therefore I won't interpret the meaning of estimate of parameters. A screenshot of result is listed below. Entered Parameter Estimate [X] Intercept IsBorrowerHomeowner [X] FICO Score TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment [X] [X] [X] [X] [X] BankcardUtilization DebtToIncomeRatio IncomeVerifiable LoanOriginalAmount Investors -6923.87 0 8.802476 0 0 0 724.0901 -1276.9 1588.358 -0.17626 2.251251 2) Next, I used partition method to classify new loans to groups. The model outperforms logistics regression model in terms of RSquare, but it generates a lower net profit for Prosper 49 as a whole. The confusion matrix is listed below. The net profit turns out to be $4,294,094, representing a 1.2% ROI for investors. Confusion Matrix Actual Predicted Loan Status 0 1 0 12025 4388 1 1597 1875 3) As a method to classify listings, Neural Network normally outperforms other models in terms of RSquare and accuracy of prediction. However, a major disadvantage of this model is its non-transparency. The intrinsic logic is unseen by users, thereby users can't really uncover the importance of determinants. Neural Network offers the highest RSquare so far -- 0.1439. However, the overall net profit that the model generates is negative. Confusion Matrix Actual Predicted Loan Status 0 1 0 12239 4904 1 954 1359 4) 1 also applied the discriminate classification method to predict the loan status. Confusion matrix is listed below. The overall error rate of this model is 36.8% and generates a positive profit of $15,155,394, representing a 6% percent ROI. However, one point worth noticing is that the model achieves the performance by compromising over 45% of the total revenue of Prosper, making it unacceptable. Confusion Matrix Actual Predicted Loan Status 0 1 0 28721 8250 50 1 15704 12432 5) Last but not least, I also test the linearity between independent variables and target prediction. So far all the models I used assume a linear relationship between predicting and target variables, while the real situation might differ from that assumption. In order to test that assumption, I tried to test the best non-linear relationship between target prediction and individual predicting variable to finalize the nonlinear factor. For instance, for FICOScore, logistic 3P, 4P, 5P, from quadratic to quantic, and other exponential relations are tested to find the optimal fit model. Eventually, a quantic FICOScore gives the best RSquare. I firstly used net P/L as target prediction and test the optimal dimensions of the relationship between the target and independent variables. A table with detailed result is listed below. As you might see, the R-Square using the individual variable is quite low, while I just want to get a sense of which non-linear model should I use if I combine all independent variables all together. Varibale Name Dimentsion R-Square FICOScore Bankcard Debttoincome NO. of credit lines OpenRevolvingMonthlyPayment Cubic Exponential 4 p Logistic 4P Logistic 4P Quintic Biexponential 5P 0.0046 0.0017 0.007 0.00011 0.0013 0.0248 LoanOriginalAmount After revising the predicting variable based on the table concluded above, I ran the regression to test whether the model is better off after applying nonlinear relationship. However, the performance of numeric regression is still disappointing at this point. The model generates an R-Square as low as 0.04. Using the model built through previous step, I further ran the logistic regression model to 51 predict the loan status. This time, the model generates an R-Square of 0.1125 and confusion matrix as below under cutoff value of 0.35 of the default probability. This model, with a nonlinear assumption between prediction and target, offers a better performance. Due to the limitation of further skillset in machine learning, I won't proceed exploring the more complicated nonlinear models. Confusion Matrix Actual Predicted Loan Status 0 1 1 0 9715 2968 3514 3419 As a hint for further research on P2P lending credit modeling, effects that can widen the gulf between good loans and bad loans should be amplified. In Prosper's dataset, I draw the distribution of each predicting variable versus loan status, and found out that the distribution of variables is almost identical for bad and good loans, except FICOScore, StatedMonthlyIncome and LoanOriginalAmount. Graphs portraying the distribution for StatedMonthlyIncome and LoanOriginalAmount are listed below. As stated earlier in this paragraph, a more accurate model needs to amplify the difference in terms of these three distinguisher variables as much as possible. 52 StatedMonthly~ncone vs. LoanStatus I 15000 14000 13000 12000 11000 10000 9000 1 StatedMonthylncome 8000 7000 S6000 S5000, 4000 3000 2000 1000 0 0 LoanStatus 36000 34000 32000 30000 28000 26000! 24000 22000 c 20000 18000 0 16000 14000 12000 10000 8000 4000 2000 0 LoanOrIglnalAmount vs. LoanStats C LoanOriginalAmount Iee *0 0 1 0 LoanStatus 3.6 Comparison of Findings in Model Building for Lending Club and Prosper This section will compare findings from Lending Club and Prosper in terms of similarities and differences and it will try to interpret the rationale behind it. As a conclusion, this section will also synthesize lessons for China's P2P operators and regulators to further scrutinize and boost the healthy development of P2P lending industry. 3.6.1 Similarities 1) A major similarity that jumps into our eyes is the negative return generated by Lending 53 Club and Prosper, a return that is quite different from what is claimed through official channels. If I only involve all finished or default loans, the ROI figure would be even worse. 2) The performance of Lending Club and Prosper in a time series is improving, indicating a learning process of P2P lending itself. 3) There are only subtle differences between bad loans and good loans regarding predictor parameters. In order to achieve a high accuracy, models need to incorporate more parameters and complicated structures. 4) The profit/confusion matrix introduced to Lending Club and Prosper is nearly identical, indicating a neutral phenomenon in the P2P lending industry in the U.S. that is not hugely different across different platforms. The cost ratio for Lending Club is 1:2.6 and 1:2.4 for Prosper in terms of identifying good loans and bad loans. 5) All same predicting parameters on Lending Club and Prosper have a same-parity impact towards the default rate, meaning that the same parameter, for instance "debt to income ratio" has a negative impact to default risk, both on Lending Club and Prosper. 6) A data model with nonlinear structured predicting parameters outperforms models with linear assumption. 7) The optimal error rate for both Lending Club and Prosper are close to 30%, and the model does a better job in identifying good loans than bad loans. 8) In order to improve the overall ROI for investors, both Lending Club and Prosper need to compromise more than 30% of their loan volume. 3.6.2 Differences 1) Although both platforms generate negative returns for investors, Prosper outperforms Lending Club due to more conservative strategies, including fewer options of terms and credit screening process (I tried to apply for a loan using my student status on both Prosper and Lending Club. Prosper rejected me right away for credit quality reasons, but Lending 54 Club offered me the listing, but with higher interest rate). 2) I might have exaggerated lending Club's loss due to the reason that there is no data disclosed regarding the default loan collection. I can expect that certain portion of principle amount could be collected back after Lending Club outsourced the collection process to collecting agencies. Besides, regarding loan status on Lending Club, there is no such category of "past due over 31 days, less than 60 days like what Prosper has. Therefore, treating all loans past due over 31 days as default might also exaggerate the scale of default loans. 3) Lending Club only discloses the last payment date, but without the previous detailed amount that indicates how much money borrowers have paid. This factor also impacts the accuracy of net profit/loss of loans on Lending Club. 3.6.3 Lessons for China's P2P Lending A big concern of mine regarding China's P2P Lending is the over-optimism of default rate on P2P lending platforms. If you look at the major official websites such as Lufax, CreditEase, or Hongli Capital, you can find that the declared default rate is less than 1%. In a more matured market like the U.S., the default rate for P2P lending is still higher than 20%, if involving all completed loans and even as high as 40%, if only considering "finished" loans. It's hard to imagine that a P2P lending market comprised of shadow banks, less trust-worthy operators and premature credit systems only carry a default rate as low as 1%. A reasonable guess is that there is huge risk hidden under the water. The reason that P2P lending platforms in China can cover this default risk under water is its unique business model. Operators in China can collect money from continuous new wealth management products and use the money to pay off promised return to previous investors. This fire-fighting model is not 55 sustainable. There is a rumor that Lufax is facing a huge default risk due to accumulating default loans on the platform right before this paper. Counter-intuitively, the higher credit rating score is not a good indicator of higher credit quality on P2P lending platforms. Public, especially active participants who're hoping that 3 rd party credit agency could be a life-saver for P2P lending in China should be skeptical at the real impact imposed by the credit rating. Besides, China's P2P lending operators are well known for non-transparent structure. Authorities in China need to push more stringent rules to regulate the disclosure of the current portfolios to investors. Besides, in order to mitigate the risk for investors, operators need to control the scale of the overall loan amount for the whole platform and scrutinize more information of borrowers. Another take away for China's P2P Lending is that the potential risk for P2P lending is quite high. Before making any decisions of investment on P2P lending, one should firstly assess his/her risk aversion. Like the cut off value I used to determine if a loan is good or bad, the judgement call or risk bearing level of different investors can vary significantly. One should also place a cut off value for his/her loss level in a long run. 4. Conclusion. 4.1. Conclusion of this paper The IPO of Lending Club and the multiples that investors placed on this stock signaled to the market that P2P lending could be a Rocket-Science. This is another case where Web 2.0 revolutionizes the traditional banking industry. It's been only 10 years since the concept of P2P lending was established in the UK, and the whole industry is still in its growing lifecycle. 56 Till now, the P2P lending industry is still premature in terms of a full eco-system, complete legislative environment and market rules. There are more and more stakeholders participating in P2P lending, such as payment collecting agencies, data filtering and analysis providers, and news channel operators specifically serving the P2P lending industry. A more comprehensive eco system needs to be built that involves the complete market dynamic and can impose strict requirements for major stakeholders. As an emerging market for P2P lending, China firstly imitated the business model from western countries and further developed according to its own specific business environment. Compared to the U.S., China has a longer way to build the credit system, regulations and transparent business model. By analyzing the data from Lending Club and Prosper, a big concern I inferred toward China's P2P lending market is its artificially low default rate. I believe that a huge default risk is still hidden under the water for China's P2P lending industry. Three conclusions could be drawn from this paper. First, the actual ROI that P2P lending platforms generated for investors is much lower than what is claimed through official channels, and even negative. Only considering matured loans, the ROI that Prosper and Lending Club offers to investors is negative 20.2% and negative 13.8% respectively. This conclusion could be referred to when considering the real default rate of P2P loans in China's Market. Besides, 3 rd party credit rating in the U.S. is not a good indicator of potential risks as expected, along with other parameters also having an unexpected impact toward the default rate. Second, higher credit rating score, lower debt-to-income ratio or lower bankcard utilization rate doesn't lead to lower default rate. Therefore, investors should not purely refer to those ratios separately when making investment decisions. Last but not least, investors can 57 use the formula computed in chapter 5 to identify good and bad loans and diversify the portfolio as much as possible to avoid adverse selection. 4.2. Further research proposed Further research on the principle collecting process after default is worth looking into. We should identify the responsibility and consequences of both borrowers and platforms, once borrowers default on the P2P lending platforms. Continuous research needs to be done in a dynamic way to capture major changes on the P2P lending market. In addition, data mining with machine learning methodologies to explore the nonlinear structure between predictor parameters and target prediction would be beneficial for both investors and platforms. This paper doesn't include the corporation model between P2P lending and other financial institutions, such as hedge fund and regional banks. Comparing features of loans invested by institutional investors and personal investors will help unveil some logistics of how to better identify default risk and prevent adverse selection. 5. References Riza Emekter, Yanbin Tu, Benjamas Jirasakuldech & Min Lu (2015) Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Applied Economics, 47:1, 54-70, DOI: 10.1080/00036846.2014.962222 Dongyu Chen, Chaodong Han, A Comparative Study of online P2P Lending in the USA and China. Journal of Internet Banking and Commerce, August 2012, vol. 17, no.2 (http://www.arravdev.com/commerce/jibc/) Simla Ceyhan, Xiaolin Shi & Jure Leskovec, Dynamics of Bidding in a P2P Lending Service: Effects 58 of Herding and Predicting Loan Success. WWW 2011, March 28-April 1, 2011, Hyderabad, India. ACM 978-1-4503-0632-4/11/03. Peter Manbeck (2014), THE REGULATION OF PEER-TO-PEER LENDING: A Summary of the Principal Issues Trend Sorbe (2011), PERSON-TO-PERSON LENDING PROGRAM PRODUCT, SYSTEM, AND ASSOCIATED COMPUTER-IMPLEMENTED METHODS. Provisional application No. 61/033,069, filed on Mar. 3, 2008. Efraim Berkovich (2010), Search and herding effects in peer-to-peer lending: evidence from prosper.com. Ann Finance (2011) 7:389-405 DOI 10.1007/s10436-011-0178-6 Sergio Herrero-Lopez (2009), Social Interactions in P2P Lending. Gregor N.F. Weif3a, Katharina Pelgerb & Andreas Horschc (2009), MITIGATING ADVERSE SELECTION IN P2P LENDING EMPIRICAL EVIDENCE FROM PROSPER.COM, Sven C. Berger & Fabian Gleisner (2009), Emergence of Financial Intermediaries in ElectronicMarkets: The Case of Online P2P Lending. BuR - Business Research Official Open Access Journal of VHB Verband der Hochschullehrer fiir Betriebswirtschaft e.V. Volume 2 I Issue I I May 2009 I 39-65. Jefferson Duarte, Stephan Siegel & Lance Young (2012), Trust and Credit: The Role of Appearance in Peer-to-peer Lending. Published by Oxford University Press on behalf of The Society for Financial Studies.doi: 10.1 093/rfs/hhs071 Ruilei Li, Yang Guo & Wei Zhang (2013), The successful rate of loan origination and determinants in P2P Lending. Financial Research, No. 7, 2013, General No. 397. Ashta, A., & Assadi, D. (2009). An Analysis of European Online micro-lending Websites. EMN 6th 59 Annual Conference (Vol. 33, pp. 4-28). Milan: Fundaci6n Nantik Lum. Retrieved from http://www.european- microfinance. org/ data/ file/ microlendingwebsites -Doc. Barasinska, N. (2009). The Role of Gender in Lending Business : Evidence from an Online Market for Peer-to-Peer Lending. The New York Times. Berlin. B6hme, R., & P6tzsch, S. (2010). Privacy in online social lending. AAAI 2010 Spring Symposium on Intelligent Privacy Management (pp. 23-28). Palo Alto: Stanford University. Retrieved from http://www.aaai.org/ocs/index.php /SSS/SSS10 /paper/ viewPDFInterstitial/1 048/1472 Chemin, M., & De Laat, J. (2009). Can Warm Glow Alleviate Credit Market Failure? Evidence from Online Peer-to-Peer Lenders. papers.ssm.com. Montreal. Retrieved from http://papers.ssrn.com/soI3/papers.cfm?abstract id=1461438 Chen, K. Y., Golder, S., Hogg, T., & Zenteno, C. (2008). How Do People Respond to Reputation: Ostracize, Price Discriminate or Punish? 2nd Intl. Workshop on Hot Topics in Web Systems and Technologies (p. 6). Palo Alto, CA: Hewlett-Packard Labs. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.9701&amp;rep=repl&amp;type=pdf Collier, B., & Hampshire, R. (2010). Sending Mixed Signals: Multilevel Reputation Effects in Peer-to-Peer Lending Markets. ACM Conference on Computer Supported Cooperative Work (pp. 1-10). Savannah, Georgia: ACM. Dhand, H., Mehn, G., Dickens, D., Patel, A., Lakra, D., & McGrath, A. (2008). Internet Based Social Lending. Communications of the IBIMA, 2, 109-114. Retrieved from http://www.doa-i.org/doai?func=abstract&amp;id=5 64232 Everett, C. R. (2010). Group membership, relationship banking and loan default risk: the case of online social lending. Group. West Lafayette, IN. 60 Retrieved from Available at SSRN: http://ssrn.com/abstract= 114428 Freedman, S., & Jin, G. Z. v. (2008). Dynamic Learning and Selection: the Early Years of Prosper. com. com. working paper. College Park, MD. Retrieved from http://www.prosper.com/downloads/research/Dynamic-Learning-Selection- 062008.pdf Freeman, R. E. (2010). Strategic management: A stakeholder approach (p. 276). Boston: Cambrigde University Press. Frerichs, A., & Schumann, M. (2008). Peer to Peer Banking - State of the Art. Gttingen. Galloway, I. (2009). Peer-to-Peer Lending and Community Development Finance. Community Development Investment Center Working Paper. San Francisco: Federal Reserve Bank of San Francisco. Retrieved from http://ideas.repec.org/p/fip/fedfcw/2009-06.html Garman, S. R., Hampshire, R. C., & Krishnan, R. (2008). Person-to-Person Lending : The Pursuit of ( More ) Competitive Credit Markets. Twenty Ninth International Conference on Information Systems (p. 17). Paris: Association for Information Systems. Garman, S., Hampshire, R., & Krishnan, R. (2008). A Search Theoretic Model of Personto-Person Lending. May. Retrieved from http://www.heinz.cmu.edu/research/244full.pdf. Greiner, M. E., & Wang, H. (2009). The Role of Social Capital in People-to-People Lending Marketplaces. Thirtieth International Conference on Information Systems (p. 18). Phoenix: Association for Information Systems. Greiner, M., & Wang, H. (2007). Building Consumer-to-Consumer Trust in e-Finance Marketplaces. 13th Americas Conference of Information Systems (Vol. 211, p. Association for Systems. Information 11). Keystone, Colorado: Retrieved http://aisel.aisnet.org/cgi/viewcontent.cgi?article= 1721 &amp;context=ancis2007 61 from Hartley, S. E. (2010). Kiva.org: Crowd-Sourced Microfinance and Cooperation in Group Lending. Group. Stanford, CA ; New York, NY. Heng, S., Meyer, T., & Stobbe, A. (2007). Implications of Web 2.0 for financial institutions: Be a driver, not a passenger (Vol. 2007, p. 11). Frankfurt. Retrieved from http://mpra.ub.uni-muenchen.de/43 16 Herrero-Lopez, S. (2009). Social Interactions in P2P Lending. Proceedings of the 3rd Workshop on Social Network Mining and Analysis (pp. 1-8). Paris: ACM. Retrieved from http://portal.acm.org/citation.cfm?id= 17310 11.173 1014 Herrero-Lopez, S., Sheng-Ying Pao, A., & Bhattacharyya, R. (2008). The Effect of Social Interactions on P2P Lending. Boston, MA. Retrieved from http://courses.media.mit.edu/2008fall/mas622i/Projects/SergioAithneRahul/Sociallnt eractionsInP2PLending.pdf Herzenstein, M., Andrews, R. L., Dholakia, U. M., & Lyandres, E. (2008). The Democratization Of Personal Consumer Loans? Determinants Of Suc cess In Online Peer-To-Peer Lending Communities. Online. Newark, DE ; Houston, TX. JIBC August 2011, Vol. 16, No.2, Retrieved from http://www.prosper.com/downloads/research/democratizationconsumer- loans.pdf Herzenstein, M., Dholakia, U. M., & Andrews, R. L. (2010). Strategic Herding Behavior in Peer-to-Peer Loan Auctions. Newark DE ; Houston, TX. Hildebrand, T., Puri, M., & Rocholl, J. (2010). Skin in the Game : Evidence from the Online Social Lending Market. Group, (October). Retrieved from http://www.rhsmith.u-nd.edu/feaconference/docs/Session3 Puri SkinintheGame.pdf Iyer, R., Khwaja, A. I., Luttmer, E. F. P., & Shue, K. (2009). Screening in New Credit Markets Can 62 Individual Lenders Infer Borrower Creditworthiness in Peer-to-Peer Lending? Management. Cambridge, MA. Jensen, M. C., & Meckling, W. H. (1976). Theory of the Firm : Managerial Behavior , Agency Costs and Ownership Structure. Journal of Financial Economics, 3(4), 305- 360. Retrieved from http://tolstenko.net/blog/dados/Unicamp/2010.2/ce73.8/03 SSRN-id94043.pdf Klafft, M. (2008). Peer to Peer Lending: Auctioning Microcredits over the Internet. Proceedings of the 2008 Int'l Conference on Information Systems, Technology and Management (pp. 1-8). Dubai: IMT. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstractid=1352383 Klein, T. (2008). Performance in Online Lending Platforms. Online. Friedrich-Schiller- Universitat Jena. Kumar, S. (2007). Bank of One. Empirical Analysis of Peer-to-Peer Financial Marketplace. 13th Americas Conference on Information Systems (p. 9). Keystone, Colorado: Association for Information Systems. Retrieved from http://aisel.aisnet.org/cgi/viewcontent.cgi?article= 1815&amp;context=amcis2007 Larrimore, L., Jiang, L., Gorski, S., Markowitz, D., Zhao, J., & Canlas, K. (2009). Making an Offer They Can't Refuse: How Borrower Language in Peer-to-Peer Lending Impacts Funding (TOP 3 Student Paper). Chicago, IL. Retrieved from http://www.allacademic.com/meta/p mia apa research citation/2/9/9/4/4/p299440 index.html Lin, M. (2009). Peer-to-Peer Lending : An Empirical Study. 15th Americas Conference on Information Systems (p. 8). San Francisco: Association for Information Systems. Lin, M., Prabhala, N. R., & Viswanathan, S. (2009a). Social Networks as Signaling Mechanisms: Evidence from Online Peer-to-Peer Lending. pages. stern.nyu.edu.College Park. Retrieved from 63 httrp://pages.stern.nyu.edu/~bakos/wise/paipers/wise2009-p09 paper.pdf Lin, M., Prabhala, N. R., & Viswanathan, S. (2009b). Judging borrowers by the company they keep: social networks and adverse selection in online peer-to-peer lending. papers.ssrn.com. College Park. Retrieved from http://papers.ssrn.con/sol3/papers.cfin?abstract id=l 355679 Livingston, L., & Glassman, T. (2009). Creating a new type of student managed fund using peer-to-peer loans. Business Education & Accreditation, http://papers.ssrn.com/sol3/papers.cfm?abstract 1(1), 1-14. Retrieved from id=1555109 Martinho, L. (2009). Combining Loan Requests and Investment Offers in Peer-To-Peer Lending. Workshop on Intelligent Agents and Technologies for e-Business (IAT4EB). Universidade do Porto. Mcintosh, C. (2010). Monitoring Repayment in Online Peer-to-Peer Lending. San Diego. Nahapiet, J., & Ghoshal, S. (1998). Social capital, intellectual capital, and the organizational advantage. Academy of management review, 23(2), 242-266. Academy of Management. Retrieved from http://www.istor.org/stable/259373 Petersen, M. A. (2004). Information: Hard and soft. Northwestern University, Chicago IL. Evanston, IL: Citeseer. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1 26.8246&amp;rep=rep1 &amp;type=pdf Phelps, E. S. (1972). The Statistical Theory of Racism and Sexism. American Economic Review, 62(4), 659-661. Pope, D. G., & Sydnor, J. R. (2008). What's in a Picture? Evidence of Discrimination from Prosper. com. Journal of Human Resources. Philadelphia, PA. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:What?s+in+a+Picture?+Evidenc e+of+Di scrimination+from+Prosner#O 64 Ravina, E. (2007). Beauty, Personal Characteristics, and Trust in Credit Markets. papers.ssrn.com. New York, NY. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstractid=972801 Rumiany, D. (2007). Internet Bidding for Microcredit: making it work in the developed world, conceiving it for the developing world. Development Gateway, March. Retrieved from http://topics.developmentgateway.org/uploads/media/ict/Internet Bidding for Microcredit.pdf Theseira, W. (2008). Competition to Default? Racial Discrimination in the Market for Online Peer-to-Peer Lending. Philadelphia, PA. Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), 13-23. Citeseer. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1 1.104.6570 Paul Slattery (2013), Square Pegs in a Round Hole: SEC Regulation of Online Peer-to-Peer Lending and CFPB Alternative. Retrieved from https://www.copyright.com/ccc/basicSearch.do? &operation=go&searchType=0 &lastSearch=simple&all=on&titleOrStdNo=0741-9457 Xubo Wang, Defu Zhang, Xiangxiang Zeng & Xiaoying Wu, A bayesian Investment Model for Online P2P Lending. J. Su et al. (Eds.): ICoC 2013, CCIS 401, pp. 21-30, 2013. Binjie Luo & Zhangxi Lin (2014), A decision tree model for herd behavior and empirical evidence from the online P2P lending market. Inf Syst E-Bus Manage (2013) 11:141-160 DOI 10.1007/s10257-011-0182-4 Radha Vedala & Bandaru Rakesh Kumar (2014). An Application of Naive Bayes Classification for Credit Scoring in E-Lending Platform Ruiqiong Gao & Junwen Feng (2014), An Overview Study on P2P Lending. International Business and Management Vol. 8, No. 2, 2014, pp. 14-18 DOI:10.3968/4801 65 Xue Rui, Bingwu Liu & Shaohua Tan (2012), Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending. Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Zhensheng Zhang (2014), CREDIT RISK PREFERENCE IN E-FINANCE: AN EMPIRICAL ANALYSIS OF P2P LENDING. PACIS 2014 Proceedings. Paper 197. http://aisel.aisnet.org/pacis2014/197 Gwangjae Jeong, Eunkyoung Lee & Byungtae Lee (2012), Does Borrowers' Information Renewal Change Lenders' Decision in P2P Lending? An Empirical Investigation. ICEC '12, August 07 - 08 2012, Singapore, Singapore concludes with a summary and the limitations of the study. Copyright 2012 ACM 978-1-4503-1197-7/12/08 Jianxian Qiu, Zhangxi Lin & Binjie Luo (2012), Effects of Borrower-Defined Conditions in the Online Peer-to-Peer Lending Market. M.J. Shaw, D. Zhang, and W.T. Yue (Eds.): WEB 2011, LNBIP 108, pp. 167-179, 2012. Mingfeng Lin, Nagpurnanand R. Prabhala, Siva Viswanathan, (2013) Judging Borrowers by the Company They Keep: Friendship Networks and Information Asymmetry in Online Peer-to-Peer Lending. Management Science 59(1):17-35. http:// dx.doi.org/l 0.1287/mnsc. 1120.1560 Chen, Dongyu; Hao, Lou; and Xu, Hong, "Gender Discrimination towards Borrowers in Online P2PLending" (2013).WHICEB 2013 Proceedings. Paper 55. http://aisel.aisnet.org/whiceb20l 3/55 Hongke Zhao, Le Wu, Qi Liu, Yong Ge & Enhong Chen (2014), Investment Recommendation in P2P Lending: A Portfolio Perspective with Risk Management Jiayu Wu (2014), Loan Default Prediction Using Lending Club Data 66 Current and Complete, matured loan. Cost matrix. Krystyna Mitrega-Niestroj (2013), RECENT DEVELOPMENTS OF THE P2P LENDING MARKET IN POLAND Eric C. Chaffee & Geoffrey C Rapp (2012), Regulating Online Peer-to-Peer Lending in the Aftermath & of Dodd-Frank: In Search of an Evolving Regulatory Regime for an Evolving Industry. 69 Wash. Lee L. Rev. 485 2012 Paul Slattery (2013), Square Pegs in a Round Hole: SEC Regulation of Online Peer-to-Peer Lending and the CFPB Alternative 30 Yale J. on Reg. 233 2013 Ying Wang & Zhangxi Lin (2014), The Importance of Objective and Dynamic Credit Evaluation in P2P Lending Market. Seth Freedman & Ginger Zhe Jin (2014), THE INFORMATION VALUE OF ONLINE SOCIAL NETWORKS: ESSONS FROM PEER-TO-PEER LENDING, Working Paper 19820 http://www.nber.org/papers/w 19820 Hossein Ghasemkhani, Yong Tan & Arvind K. Tripathi (2013), The Invisible Value of Information Systems: reputation Building in an Online P2P Lending System Laura Gonzalez & Yuliya Komarova Loureiro (2014), When can a photo increase credit? The impact of lender and borrower profiles on online peer-to-peer loans 67