Document 10790185

advertisement
Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers
By
Yang Yang
B.Sc. Management of Science and Project
University of Science and Technology of China, 2007
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIES
ARCHNES
MASSACHUSETTS INSTITUTE
OF TECHNOLOLGY
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUN 2 4 2015
JUNE 2015
LIBRARIES
2015 Yang Yang. All rights reserved
The author hereby grants to MIT permission to reproduce
and to distribute publicly paper and electronic
copies of this thesis document in whole or in part
in any medium now know or hereafter created.
Signature of Author:
Signature
redacted
MIT Sloan School of Management
Signature redacted
May 8, 2015
Certified by:
Christian Catalini
Assistant Professor of Technological Innovation,
Entrepreneurship, and Strategic Management
Signature redacted Thesis Supervisor
Accepted by:
Michael A. Cusumano
SMR Distinguished Professor of Management
Program Director, M.S. in Management Studies Program
MIT Sloan School Of Management
I
2
Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers
By
Yang Yang
Submitted to MIT Sloan School of Management
on May 8, 2015 in Partial Fulfillment of the
Requirements for the Degree of Master of Science in
Management Studies.
ABSTRACT
Since the establishment of the first P2P lending platform in 2005, P2P lending industry has
been nibbling the market share of traditional consumer credit. In 2014, Lending Club and
Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the
U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over
traditional banks, the market for P2P lending is expected to grow rapidly along with the
improvement of the internal system of P2P lending platforms, external regulation and more
participation from borrowers and lenders. Given the fact that most P2P lending platforms in
China first imitated the business model from either the U.S. or European platforms, they have
progressively evolved to incorporate different business models due to legislation, economic
or behavioral reasons.
Several findings are detected by analyzing the data form Lending Club and Prosper. First,
although both platforms progressively improve the default rate each year, currently both
platforms offer negative returns for investors. Second, if only considering finished/matured
loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan
will cost a loss more than twice as much as the interest return offered to investors. Taking this
cost matrix into consideration, the optimal data model won't necessarily provide the highest
accuracy but maximum return. Fourth, the ex post return offered by the platforms is not
enough to cover the potential risk facing investors.
Thesis Supervisor: Christian Catalini
Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic
Management
3
4
Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers
By
Yang Yang
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT
OF THE
REQUIREMENTS FOR DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIES
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUNE 2015
PURPOSES OF THIS PAPER
It's been almost 10 years since the first P2P lending platform was founded in the UK. While
P2P lending has been growing rapidly within the past 10 years, it is still in the infant stage
compared to the traditional banking industry. There are over 70 academic papers about P2P
lending between 2008 and 2015, but from different perspectives, including analyses of
determinants of a loan to be successfully funded by investors, regulations, credit risks,
determinants of credit quality and default probability, business model of P2P lending across
countries, internal information system and literature reviews.
Even though a handful of papers did research on credit risks using data mining methodologies,
most of them were focused on explaining the determinants of a loan being successfully
funded. Few literature considered cost matrix in the model or compared results from Prosper
and Lending Club. P2P lending is a two-sided market. In order to further boost market growth,
P2P lending platforms also need to enhance the ability of investors to assess credit risks. By
doing this, Platforms can offer higher return, and thus, attract more participation of investors
in lending activity.
The main purpose of this paper is to identify key determinants of a loan's default probability
and respective coefficients, and then build the optimal model to predict the loan's status. This
model will act as a way to mitigate information asymmetry on P2P lending and gaming
philosophy of borrowers. Besides, this paper will also take a dynamic review of the current
development of P2P lending built on previous literature.
Another motivation for this paper is that the Chinese government just granted the
participation of personal credit rating business from non-state owned companies. The public
believes this movement will become the game changer for the internet finance industry,
especially the P2P lending segment. This paper will justify whether a 3 rd party credit rating
3
will help investors prevent adverse selections.
ABSTRACT
Since the establishment of the first P2P lending platform in 2005, P2P lending industry has
been nibbling the market share of traditional consumer credit. In 2014, Lending Club and
Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the
U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over
traditional banks, the market for P2P lending is expected to grow rapidly along with the
improvement of the internal system of P2P lending platforms, external regulation and more
participation from borrowers and lenders. Given the fact that most P2P lending platforms in
China first imitated the business model from either the U.S. or European platforms, they have
progressively evolved to incorporate different business models due to legislation, economic
or behavioral reasons.
Several findings are detected by analyzing the data form Lending Club and Prosper. First,
although both platforms progressively improve the default rate each year, currently both
platforms offer negative returns for investors. Second, if only considering finished/matured
loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan
will cost a loss more than twice as much as the interest return offered to investors. Taking this
cost matrix into consideration, the optimal data model won't necessarily provide the highest
accuracy but maximum return. Fourth, the ex post return offered by the platforms is not
enough to cover the potential risk facing investors.
Thesis Supervisor: Christian Catalini
Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic
Management
4
Table of Contents
1. INTRODUCTION.................................................................................................................................
6
1.1 DEFINITION OF P2P LENDING ...................................................................................................................
7
1.2 How P2P LENDING W ORKS (LENDING CLUB, PROSPER)................................................................................7
2. MARKET REVIEW OF P2P LENDING .............................................................................................
10
2.1 MEARKET SIZE .......................................................................................................................................
10
2.2 KEY PLAYERS AND RESPECTIVE M ARKETPLACE ..........................................................................................
11
2.3 M ARKET OUTLOOK OF P2P LENDING .......................................................................................................
13
2.4 BUSINESS M ODELS OF P2P LENDING........................................................................................................15
3.
DATA ANALYSIS AND M ODELING .............................................................................................
19
3.1 INTRODUCTION ....................................................................................................................................
19
3.2 KEY VARIABLES .....................................................................................................................................
20
3.2.1 Prosper .....................................................................................................................................
20
3.2.2 Lending Club.............................................................................................................................20
3.3 DISTRIBUTION OF DATASET .....................................................................................................................
3.3.1 Prosper .....................................................................................................................................
21
21
3.3.2 Lending Club.............................................................................................................................24
3.4 M ODEL BUILDING AND INTERPRETATION-LENDING CLUB ...........................................................................
26
3.4.1 Data Preparation......................................................................................................................27
3.4.2 M odel Building .........................................................................................................................
29
3.4.3 M odel interpretation ................................................................................................................
32
3.4.4 Robustness Check .....................................................................................................................
34
3.5 M ODEL BUILDING AND INTERPRETATION-PROSPER ......................................................................................
38
3.5.1 Data Preparation......................................................................................................................38
3.5.2 M odel Building .........................................................................................................................
43
3.5.3 M odel interpretation ................................................................................................................
47
3.5.4 Robustness Check .....................................................................................................................
49
5
3.6 COMPARISON OF FINDINGS IN MODEL BUILDING FOR LENDING CLUB AND PROSPER .......................................
53
3.6 .1 Sim ilarities................................................................................................................................5
3
3 .6 .2 Differences................................................................................................................................54
3.6.3 Lessons for China's P2P Lending .........................................................................................
4. CO NCLUSIO N. .................................................................................................................................
4.1. CONCLUSION OF THIS PAPER................................................................................................................
55
56
56
4.2. FURTHER RESEARCH PROPOSED.............................................................................................................58
5. REFERENCES....................................................................................................................................
58
1. Introduction
Freedman and Zhe Jin (2007) wrote the first academic paper to look into the business of P2P
lending. They brought up the question of whether P2P lending would reshape the future of the
financial industry or if P2P lending would be a fad that would wane over time. Even though
it's been over 6 years since that paper, it's still too early to give an answer to that question,
whereas what we see on the market is the emergence of more P2P lending platforms globally
and the IPO of Lending Club in December 2014. In addition, the attitude of traditional banks
toward this infant industry is also evolving. For instance, in early 2014, one employee of
Wells Fargo told the media that one internal email was sent by the principal requesting all
employees of Wells Fargo not to get engaged with any business of P2P lending. By contrast,
many hedge funds or regional banks are purchasing personal Loan products from P2P lending
platforms due to stable and attractive return. In addition, more traditional financial
institutions also opened their own P2P platforms to catch up with the trend.
6
1.1 Definition of P2P Lending
P2P stands for Peer-to-Peer or Person-To-Person. In P2P lending, platforms act as
intermediaries matching lenders and borrowers, and transact the money. P2P lending was first
introduced by Zopa in UK, 2005. By the time of this paper, Zopa has originated 713 million
GBP and is one of the biggest platforms in the world. The emergence of P2P lending is also a
result of applying web 2.0 in financial industry. By reducing the overhead cost and
infrastructure of traditional banks, P2P lending platforms can offer lower interest rate for
borrowers and accumulate huge traffic within a short period (Dhand et al., 2008).
1.2 How P2P Lending Works (Lending Club, Prosper)
fl~ctdApk*u
Lafure fistimp
k"vma
Borrowers want to apply for personal loans for various reasons. The main reason of personal
loans on Lending Club and Prosper is credit consolidation. A borrower applies for loans by
providing private information such as loan amount, term, credit rating score, debt-to-income
7
ratio, monthly income, occupation and the loan purpose. Both platforms will then assess the
information and decide a fixed interest rate for the loan. After the interest is agreed on by the
borrower, the loan will be listed on the platform for investors to browse. Then investors can
browse loan information and decide whether to invest and how much to invest.
Among the 73 papers on P2P lending between 2008 and 2015, 20 papers discussed how to
increase the possibility of loans being successfully funded and what are the key determinants.
Compared with unverified variables, verified variables play a much more significant role in
determining whether to invest a loan (Gregor, et al., 2010). Also, borrowers who are willing
to disclose more information normally pay less interest rate (B6hme et al., 2010). Social ties
will increase the chances of having the loan fully funded (Sergio, 2009; Greiner & Wang,
2009; Herrero-Lopez, 2009; Hildebrand & Rocholl, 2010; Lin 2009), reduce the ex post
interest charged on the loan, and also decrease the default risk associated with the loan (Lin et
al., 2009; Zhensheng, 2014). Furthermore, some research is focused on the contribution of
demographic information of borrowers on loan funding such as appearance and gender.
Research shows that appearance also does influence the decision of lenders to fund a loan or
not (Jefferson et al., 2012). Female borrowers are less likely to get loans funded than are male
borrowers.
Based on all the information provided by the borrower, investors then need to determine
whether to lend and how much to lend. The objective of lending money on P2P platforms is
to gain high return and mitigate default risk. Investors on P2P lending platforms are inclined
to invest in loans with higher ex post return, which also carry higher default risk. Assessing
default risks based on previous loans' performance is another focus of academic papers.
8
There are 8 papers that built models to investigate what are the key determinants of default
risk, so investors can use this as a guideline to avoid adverse selection. Loans with lower
credit grade and longer terms will result in higher default risk (Riza et al., 2015). This finding
is opposite from the result in this paper because in my paper, rather than using either
completed loans or matured/finished loans, I used a combination of both. There are
discrepancies between risk premiums charged and real default risk associated with loans on
P2P lending platforms (Kumar, 2007). This conclusion is supported by the fact that the proof
shows that the premium charged by P2P platforms is not enough to cover the potential loss of
investors (Riza et al., 2015). Recommendations were also imposed that another way to
mitigate default risk of loans is to set up a social reputation system in P2P lending platforms
(Everett, 2010; Lin, 2009).
Platforms will charge borrowers a loan origination fee once the loan is successfully funded.
Investors will also be charged a service fee of managing installment payments from
borrowers. A handful of papers were focused on building the internal information system of
P2P platforms. For instance, Collier (2010) informed practice and theory on developing
community reputation that can improve information asymmetry on Prosper and mitigate
adverse selection. Also, as an intermediary in the financial market, platforms are regulated by
both SEC and CFPB. 4 papers uncovered the current regulations on P2P lending and inform
implications for further development of specific regulation for P2P lending. A multi-agency
regulatory approach of P2P lending should be implemented that intimates the approach
applied to regulate traditional lending (Eric et al., 2012).
Borrowers need to pay monthly installment payments until the the loans reach maturity. If
9
desired, they can also choose to repay all principle payments ahead of the loan's maturity by
paying a service fee. Platforms also provide a trading system to investors who want to sell
holding loans with a certain discount. This trading system, like an open market, helps
platforms to provide more flexibility to investors.
However, some loans default in early stages of installment payments. This causes a huge loss
for investors as a whole. Investors are inclined not to hire an agency to collect net principle
loss due to the small amount of investment (Freedman & Jin, 2008). Further research into
after-default management of P2P lending is an urgent need because it can help mitigate net
principle loss of investors and improve the risk-adjusted return of platforms as a whole.
2. Market Review of P2P Lending
2.1 Market Size
The potential market size of P2P lending could be measured in both micro and macro ways.
The market size of P2P lending is mainly the size of unsecured loans, including unsecured
personal loans and line of credit. The total amount of consumer credit in the U.S. as of Oct,
2014 is 3.283 trillion USD, as asserted by Federal Reserve G.19 release. Per the E2 Release
of Federal Reserve, the total amount of outstanding business loans ranging from $10,000 to
$99,000 is 3.4 billion. We can sum up above two components as the potential market size for
P2P lending, which is 3.286 trillion USD purely in THE U.S. market. Currently, Prosper
contributes 2 billion in fund lending, and Lending club contributes 6 billion in loans.
In a macro way, we can even expand the market to the middle size business loans since
lending club also provides business loans up to 300K USD. The total amount of business
10
loans ranging from IOOK to 999K is 12 billion (Donghon, 2014). Conservatively, we can add
another 2.4 billion to the potential P2P lending market. This will result in a market with a
total amount of 4.288 trillion USD dollars. Investors on P2P lending platforms are about to
eat between 25 percent and 30 percent of the business that traditional banks are doing. The
overall market of P2P lending will then grow to about $1 trillion by 2025 (Cromwell, 2015).
2.2 Key Players and Respective Ma rketplace
Rank
Lending Site
Year Founded
Loan Volume($billion)
Country
1
Lending Club
2007
6
USA
2
CreditEase
2006
3.2
China
3
Upstart
2012
3
USA
4
Prosper
2006
2
USA
5
Zopa
2005
0.8
UK
Lending Club. Lending club which was founded in 2007 has been paying investors $590
million in interest returns. Per the statistic data from Lending Club's websites, by
3 0th
September 2014, 83.17% of Lending Club borrowers reported that they use loans from
Lending Club to refinance existing loans or pay off their credit cards. The breakdown of the
main purposes of Lending Club loans is shown below.
11
/J
-
Ct
--
.'
F g:ff
Prosper. Prosper, founded by Chris Larsen and John Witchel on February 5, 2006, was the
first P2P Lending platform in the U.S. It stays unlisted and is financially supported by several
big names in venture capitals. Till now, Prosper had more than 2 million members and
generated over 2 billion loans.
Upstart. It was founded by ex-Googlers in 2012 in the U.S. and originated more than $3
billion in loans with an annual growth rate of 265%. The major difference that lies between
Upstart and other platforms is that when assessing the credit quality of borrowers, Upstart
starts with the same information but will further include academic variables to come up with
the risk assessment more statistically.
CreidtEase. As reported by Peter Renton in 2014, CreditEase is the largest P2P lending
platform in China and has generated more than $3.2 billion USD in loans to over 500,000
borrowers. This company was founded in 2006 and is now operating in over 150 cities of
China.
Zopa. Zopa is the oldest Peer-to-Peer lending company in the world. The company was
founded in 2005 in the UK. It has lent $1 billion USD and has helped both borrowers and
investors get better rates.
12
2.3 Market Outlook of P2P Lending
The emergence of P2P lending exceeded the public's expectation in recent years. P2P lending
would increase by 66% to a total size of 5 billion USD by the end of 2013 (Gartner, 2010).
Looking at the statistic data of the biggest platforms, I found that lending club experienced
over 150% annual growth rate till 2014. Besides, Prosper.com also achieved exponential
growth since its establishment. Till the end of 2013, it originated over 300 million USD in
loans and moved this number to over 1.5 billion USD in loans by the end of 2014.
Despite the fact that it's extremely difficult to estimate the exact growth rate of P2P lending,
there are several determinants that can indicate the future trend of P2P lending from a macro
perspective. 1) Geographic expansion. Till now, P2P lending is not fully authorized in all
states of the U.S. due to the complexity of autonomy. Even in China, the acceptance of P2P
lending varies among different regions. Further geographic expansion would be expected in
the next few years. 2) More comprehensive legislation. The main reason that certain public
authorities or groups are still skeptical about P2P lending is that it is still in its infancy and is
less regulated compared to traditional banks. The specific regulations for P2P lending are an
urgent need in the market. 3) Challenges from traditional banking. Given the fact that the P2P
lending has huge cost-advantage to traditional banks, with the recovery of the U.S. economy,
the government is considering loosening the requirement for loan borrowers. This will help
traditional banks to regain borrowers who are not entitled to a loan. In China, many financial
institutions also introduced their own P2P platforms to gain a piece of the pie. 4) Information
asymmetry. Information asymmetry might lead investors to adverse selection (Akerlof, 1970)
and moral hazard (Stiglitz and Weiss, 1981). Various efforts are being made in order to
13
mitigate the information asymmetry by the platforms. 5) Bottom line of the economy and
employment. The performance of both the economy and employment will impact the further
development of P2P lending. As the statistic data from Proper and Lending club, most of the
borrowers' purpose is credit consolidation. Stronger economy and improved wages and
employment rate indicate that people's financial condition will be better off and the need of
credit consolidation will decline accordingly. 6) Institutional investors. P2P lending can
provide a higher ROI than many other investments in the financial market. There are
institutional investors who purchase loan packages from platforms to gain stable cash flow
and return. A simple comparison among different financial investments is listed below. In
2013, P2P lending generated much lower return than NYSE and Dow Jones Industry
Composite, but outperformed NYSE and Dow Jones in 2014. However, for P2P lending
platforms,
I'm using the official investment return rate while the true risk-adjusted
investment return might vary from this data. Another point worth noticing is that the superior
return from stock market in 2013 is due to the recovery from an economic and financial
downturn. An ROI around 10% is already very impressive in the financial investment sector.
As reported by Bloomberg, the average return of hedge funds was 7.4% in 2013.
Investment
Lending club
Prosper
3yr T
NYSE
Dow Jones
2014
10.50%
9.79%
1.10%
4.22%
7.52%
2013
8.75%
9.86%
0.78%
23.18%
26.50%
Till the end of 2014, the total amount of loans originated through P2P lending in China has
reached $40 billion with a default rate of 17.46%. 1.16 million borrowers got their loans
funded by 630,000 investors, and these numbers increased by 364% and 320% compared
14
with numbers of 2013 respectively. There are 1575 P2P lending platforms in China, and 275
went bankrupt in 2014, implying that one out of six platforms was not sound. The average
amount of loans and money that individual investor funded is $35,000 USD and $64,000
USD. This statistics data comes from Wangdaizhijia.com in China.
2.4Business Models of P2P Lending
This section will introduce the business models used by major P2P lending platforms in the
U.S and China and address the major differences between the two markets.
In the U.S. market, the business models of P2P lending platforms are quite similar to each
other. Borrowers post their loans on platforms and investors browse and choose loans to
invest. The P2P lending platform acts as an intermediary and is responsible for risk rating,
determining interest rate, document verification and interest payment management. However,
Prosper and Lending Club still varies in several ways as below.
1) Loan type. Prosper only originates personal loans ($2000-$35,000 USD) while Lending
Club also originates business loans up to $300,000 USD and personal loans ranging from
$1000 to $35,000 USD. Besides, Prosper and Lending Club provides loans with different
maturities. Both provide 3-year and 5-year loans. In addition, Lending Club provides a
1-year loan as well.
2) Interest rate. P2P platforms determine the interest rate by considering information
reflecting borrowers' credit quality. Both Prosper and Lending Club stipulate the cap and
floor interest rate for loans falling into different credit Rating/Grades. However, Interest
rate in the same credit category varies between Prosper and Lending Club due to different
credit rating logic.
15
3) Credit scoring. Prosper and Lending Club provides a proprietary credit score as a major
indicator of loan risk. They both offer 7 rating categories, Prosper from HR (worst) to AA
(best) and Lending Club from G (worst) to A (best).
4) Origination Fee. Platforms earn money by charging fees to borrowers. The cap and floor
fee rates charged by Prosper and Lending Club are the same, whereas different rates are
charged for borrowers in different risk categories. A simple comparison is listed below,
including credit rating, respective interest rate and origination fee.
Lending Club
Rating
AA
A
B
C
D
E
HR
Interest Rate Origination Fee
1%2%
6.05%'7.96%
4%
8.19%11.33%
5%
11.56%'14.06%
5%
14.59%'18.27%
5%
19%'22.68%
5%
23.44%27.04%
5%
27.75%31.25%
Rating
A
B
C
D
E
F
G
Interest Rate Origination Fee
%3%
5.49%'8.19%
4%-5%
8.67%11.99%
5%
12.39%'14.99%
5%
15.59%-17.86%
5%
18.54%21.99%
5%
22.99%-25.5.7%
5%
25.8%'26.06%
5) Affiliate & Referral Programs. Prosper introduces the affiliate program to attract more
borrowers and lenders from referrers and to provide $100-150 USD for borrower leads
and $50 for lender leads. Lending Club also introduced the affiliate & Referral program,
but detailed bonuses are not provided on its website.
6) Both Prosper and Lending Club provide Notes Trading Platform, where investors can
trade their holding notes with each other. Folio is a Broker-Dealer platform which only
charges sellers 1%.
7) Early repayment. Borrowers can choose to pay the remaining repayment without paying
any penalty, in order to refrain from paying monthly interest in the future.
8) Interest Auction. P2P lending platforms normally regulate the interest rate for loans,
16
based on the information provided by the borrowers. However, in early years, Prosper
introduced interest an rate auction in which investors can bid the lowest interest rate they
can accept to compete funding the most popular loans. This is the reason why sometimes
we can see that the loans were originated with a lower interest rate. Prosper stopped the
interest auction service in 2011 and implemented a fixed interest rate like Lending Club.
In China's market, P2P lending platforms are basically following the same model as those in
the U. S., acting as an intermediary between borrowers and lenders. However, due to
differences of economic and legal environment, as well as the customer's behavior, there are
unique features which evolved from P2P lending in China. We use Hongling Capital and
Creditease as representatives since they are two of the earliest P2P platforms which
originated in China.
1) Loan Type. Hongling Capital offers personal and business loans with an amount between
$500 and $1,600,000 USD, with maturities between 3 months and 12 months. Creditease
offers personal loans of amounts between $1,600 USD and $1,000,000
USD with
maturities between 1 year and 4 years. Obviously, P2P lending platforms in China's
market are more aggressive and also bear higher default risk.
2) Interest Rate. Rather than determining the interest rate based on credit score, maturity and
amount as P2P platforms in the U.S., China's platforms determine the interest rates
simply based on loan type or maturity, because there is no credit agency that can provide
a comprehensive credit report for individuals (China's PBOC just authorized certificates
for credit agency in January 2015). Hongling Capital regulates interest rate between 8%
and 18% and Creditease between 10% and 12.5%.
17
3) Credit Scoring. The only credit report that a borrower can submit is the one provided by
PBOC that includes the history of credit card usage and loan repayment. Platforms don't
rate borrowers into different credit categories, which differs from U.S. platforms. It's a
common practice
for platforms to enable
credits to borrowers/investors
if they
successfully pay the monthly payment or make investment. For instance, Hongling
Capital category sorts customers into 5 categories from VI (lowest) to V5 (highest).
Investors on Hongling Capital can refer to different categories as a risk indicator.
4) Origination fee. Creditease charges investors 10% of interest earnings and borrowers 10%
as service fee. Rates and Fees on Hongli is more complex. Hongli charges investors from
0% to 10% as fees. This charge is determined depending on the categories, which range
from V I to V5. For instance, investors in VI need to pay 10% of interest earnings as a
service fee, and those in V5 don't need to pay any service fee. For borrowers, Hongli also
charges various percentages on loans, as a service fee based on different loan types. The
overall range is from 3% to 14.6%.
5) Affiliate & Referral Programs. Creditease doesn't pay the referral bonus, while Hongli
pays $6 USD if the referred customer registers as a normal member, and $12 USD if he
registers as a VIP.
6) Notes trading. Platforms in China also provide notes trading services to investors.
7) Early repayment. On Creditease, if borrowers want to pay the remaining loan earlier,
besides the interest for the current month, remaining loan and service fee, they need to
pay a 0.5% of the remaining loan as a penalty to the platform. Similarly, borrowers on
Hongli Capital need to pay interest for an extra month as penalty if they want to pay off
18
the remaining loan earlier.
8) Principle Guarantee. The biggest difference between the U. S. and China in P2P lending
is that many platforms in China introduce a
3
rd party company to guarantee the safety of
investors' money, just in case any fraudulent funding happens. This is the remedy for the
lack of credit score available from borrowers and platforms that will improve the
confidence of investors. However,
3 rd
party guarantee is not a catholicon for P2P lending
in China. A certificate of Guarantee Company only costs $1
million USD and there are
cases where owners disappeared with the money, leaving investors to lose all their money.
3. Data Analysis and Modeling
3.1 Introduction
There are questions being addressed in this section, including 1) the distribution of PV, rate of
bad loans and interest of different credit categories. 2) Whether the risk-return improves from
year to year, especially when platforms change their policy. 3) Any behavior difference of
borrowers and investors between Prosper and Lending Club. 4) Investigate the contribution
of determinant variables to the performance of loans. 5) Build the model to determine the
possibility of default using different data mining methodologies. 6) As researched by Riza,
Yanbin, Benjamas and Min in 2014, the higher interest rate regulated by Prosper and
Lending Club for riskier loans is not enough to reimburse the potential loss exposing to
investors. This section will use a FCFF methodology to test this conclusion considering the
time value of future cash flow.
19
3.2 Key Variables
3.2.1 Prosper
Variable name
Type
Definition
Credit Rating
Numeric
Proprietary Credit rating by P2P lending platforms
Loan Status
Dummy
Borrower Rate
Numeric
Whether the loan is active, completed or default
Interest rate borrower is willing to pay
Borrower APR
Numeric
Actual rate borrower needs to pay considering service cost
Lender Yield
Numeric
Actual rate lenders receive considering service cost
Listing Category
Dummy
Numeric
The purpose of the loan
The time period of employment till the creation of listing
Current Credit Line
Numeric
Numeric
Whether the borrower owns real estate
The number of credit lines the borrower owns
OpenRevolvingMonthlyPayment
Numeric
RevolvingCreditBalance
Numeric
The monthly payment of revolving account
The current credit balance of revolving account
BankcardUtilization
Numeric
The percentage utilization of revolving credit balance
AvailableBankcardCredit
Numeric
The total amount of bank card credit till the creation of the loan
TradesNeverDelinquent
DebtToIncomeRatio
Numeric
Numeric
The percentage of delinquency of trades
The percentage of debt to income
StatedMonthlyIncome
Numeric
LoanOriginalAmount
Numeric
Monthly income stated by borrowers
The original amount of loan originated
Investors
Numeric
Terms
Numeric
Employment Duration
Is Borrower home owner
The number of investors who fund the loan
The term length of the loan
Both Prosper and Lending Club define "bad loans" as loans that are 60+ days past due within
the first twelve months from the date of loan origination.
3.2.2 Lending Club
Variables
Type
Definition
Grade
Dummy
The proprietary credit rating of Lending Club
loan-status
Dummy
int rate
Numeric
The current status of the loan
The interest rate the borrower needs to pay
Purpose
emplength
Dummy
Numeric
home-ownership
Dummy
open acc
Numeric
If the borrower owns or rents an apartment
The number of open credit line of the borrower
revol bal
Numeric
The amount of current credit balance
The purpose of the loan
The time length of the employment of the borrower
20
revol util
dti
Numeric
Numeric
annual inc
loan amnt
Numeric
installment
Term
The current ratio of credit balance utilization
The debt to income ratio
The amount of annual income
The amount of the loan
The amount of monthly payment
The term length of the loan
Numeric
Numeric
Numeric
3.3 Distribution of Dataset
3.3.1 Prosper
When depicting the distribution of loan's characteristics, we exclude current and cancelled
listings that haven't completed and funded. Besides, records with proprietary credit rating
"NC" are excluded due to incomplete information, and those loans were originated in early
2006 and 2007 when Prosper was in infancy. There are 113 rows of records that are missing
proprietary credit rating. We assume that these records won't influence the validity of our
analysis due to the small amount of records.
Amount of Loans
Mean
Average
9,466 12,000
9,685 11,000
9,764 12,000
8,423 10,000
7,500
6,326
4,000
4,250
3,500
3,056
Successful
Credit Category
AA
A
B
C
D
E
HR
Credit Category
AA
A
B
C
D
E
HR
Rate
30%
23%
25%
29%
47%
49%
76%
1vear
3%
3%
3%
2%
2%
3%
0%
Number of Loans
6,487
10,479
12,023
14,892
15,259
10,286
8,846
Term
3 years
93%
86%
79%
76%
83%
87%
100%
Total
61,402,940
101,490,254
117,411,802
125,436,437
96,539,254
43,717,649
27,031,067
STDEV Default Rate
11%
6,664
16%
6,664
22%
8,345
28%
7,044
31%
5,853
37%
2,629
46%
1,323
5years Interest rate $/investor Credit Score
791
53
4%
8.9%
738
73
12%
11.4%
712
15.4%
87
190/
104
682
22%
18.9%
667
91
23.6%
15%
640
10%
28.3%
103
621
29.3%
89
0%
21
There are several features of the dataset distribution of Prosper. 1) Surprisingly, the
successful rate of a listing being funded to be a loan decreases when credit worsens. This
might be caused by the higher interest rate paid by worse credit rating. 2) The majority of
loans are from C and D, consistent with our expectation that the major loans on Prosper (even
most of the P2P lending platforms) came from borrowers with poor credit record. 3) From the
best credit rating to the worst, the average and medium amount of the loan is declining
continuously, majorly because the limitation placed by P2P platforms. 4) The default rate
climbs when credit getting worse. The default rate of A-loan is 11%, while 46% for HR-loan.
5) As we expected, interest rate increases when credit quality declines. An assessment will be
done in the following section to test if the interest rate advised by Prosper is enough to cover
the potential loss. 6) There is a trend that for loans with poor credit rating, investors tend to
place more money on each investment.
Number of Loans
12,000
18,000
16,000
10,000
14,000
12,000
8,000
10,000
6,000
8,000
NO. of Loans
--
4,000
6,000
4,000
2,000
2,000
0
AA
A
B
C
D
E
HR
22
Ave rage Amount
Borrower Rate vs. Prosper Rating
0.3443
03288
03125,
0299
02863
0.2745
0.2623
0.2521
0.2417
0232
02225
0.2127
0.2025
0.1932
3t 0.1839
0 0.1753
0:1679
0.1587
0.1495
0,1424
0.1338
0.1248
0.1162
0.1075
0.0985
0.0911
0.0813
0.0714
0:0623
0
-
h
A
B
AA
Smooth(Borrower Rate)
C
E
D
HR
Prosper Rating
Percentage of Total Loans by amount
Year
AA
A
B
C
D
E
HR
Default Rate
2006
2007
2008
2009
2010
2011
2012
2013
2014
7.5%
15.4%
23.3%
21.6%
16.1%
7.3%
7.3%
4.6%
6.8%
7.7%
16.8%
19.5%
24.9%
20.9%
17.5%
17.7%
16.5%
18.3%
9.3%
19.9%
23.2%
6.9%
14.7%
16.9%
18.1%
24.5%
24.0%
11.2%
21.3%
17.4%
17.9%
9.0%
9.1%
22.8%
31.2%
29.4%
9.8%
15.3%
11.2%
13.6%
19.5%
27.1%
18.9%
14.9%
1.4%
8.9%
6.2%
3.0%
5.2%
8.3%
16.7%
5.4%
6.8%
6.5%
45.6%
5.2%
2.5%
9.8%
11.5%
5.5%
9.9%
1.5%
13.6%
39.2%
39.5%
33.0%
15.2%
16.7%
22.6%
31.2%
23.6%
24.5%
7) Year by year, more investors switch to riskier loans from A or AA classes, especially to
loans in B and C. This trend might be caused by investors seeking higher interest rate as well
as the improved loan default rate under each credit category. 8) Both the overall default rate
and the default rate for each credit category decreased continuously. However, investors are
becoming more risk-averse. This improvement can be explained by the effort that Prosper is
better off in risk screening and verification.
(When calculating the default rate, loans that originated after Q2 2014 are excluded from the
dataset, because no loans could be past due more than 60 days, and when they do, they are
considered as default)
23
Default rate YoY
Year
AA
A
B
C
D
E
HR
Overall
2006
8.8%
16.7%
24.7%
36.2%
35.8%
48.8%
64.8%
39.2%
2007
14.3%
25.8%
33.3%
41.1%
42.8%
53.2%
62.2%
39.5%
2008
18.3%
25.6%
32.9%
33.4%
37.4%
43.6%
52.5%
33.0%
2009
6.0%
9.3%
16.8%
15.4%
22.4%
22.3%
23.7%
15.2%
2010
3.9%
9.8%
11.2%
15.3%
21.4%
24.9%
25.4%
16.7%
2011
2.9%
9.4%
15.5%
14.9%
24.8%
32.1%
31.0%
22.6%
2012
8.1%
9.3%
14.1%
20.1%
23.9%
25.9%
28.5%
31.2%
2013
4.1%
2.8%
4.6%
7.5%
10.8%
13.1%
13.6%
23.6%
2014
8.7%
0.4%
0.7%
1.2%
1.6%
2.5%
1.7%
24.5%
3.3.2 Lending Club
Amount of Loans
Credit
Category
Successful
Rate
Number of
Loans
Total
Average
STDEV
Default
Rate
A
B
C
D
E
F
G
32.6%
28.8%
26.7%
28.3%
29.1%
33.6%
33.6%
20,076
33,882
27,641
17,980
8,484
3,772
916
213,245,525
402,115,200
352,094,900
246,222,500
148,964,150
73,021,450
20,171,950
10,622
11,868
12,738
13,694
17,558
19,359
22,022
6,586
6,861
7,769
8,426
9,505
9,225
8,417
8.5%
17.2%
24.2%
30.8%
36.4%
43.5%
43.2%
1) There is no significant difference of successful rate listing being funded across different
credit categories in Lending Club, 2) Loans are more concentrated on good-credit loans
from A to D in terms of number of loans and total amount. 3) What is different from loans
on Prosper are lower-credit loans on LC which tend to have bigger amount than
higher-credit loans. This is an indicator that LC considers amount as a contributor when
rating loans. 4) There is no significant switch of investors' risk aversion year by year on
lending club. 5) The default rate of LC is much lower than Prosper in each year and under
each category, but this doesn't mean that the overall risk return that Lending Club
generates is higher than Prosper as a whole. More detail will be interpreted in the
24
following sections. 6) Interest rate for loans among the same credit rank on LC and
Prosper is similar. 7) There is a trend of improvement regarding default rate from 2007 to
2010. I don't involve years after 2011 into consideration since most loans are still under
regular payment process, whereas for loans originated in early years, most of them are
either fully paid or went default.
Percentage of Loans by credit grade-LC
Year
A
B
C
D
E
F
G
2007
2008
2009
2010
2011
2012
2013
2014
22.7%
18.9%
25.0%
24.3%
26.5%
20.4%
13.1%
14.2%
24.3%
32.5%
28.9%
30.7%
30.2%
34.7%
32.7%
26.6%
29.9%
28.0%
25.3%
21.4%
18.1%
22.3%
28.3%
28.1%
14.7%
14.2%
13.9%
14.0%
12.9%
13.7%
15.3%
18.9%
5.6%
4.8%
5.0%
6.9%
8.0%
6.0%
6.7%
8.7%
2.8%
1.3%
1.4%
2.1%
3.3%
2.5%
3.3%
2.8%
0.0%
0.3%
0.5%
0.8%
0.9%
0.5%
0.6%
0.8%
Default Rate YoY-LC
Year
A
B
C
D
E
F
G
Overall
2007
2008
2009
2010
2011
2012
2013
2014
1.8%
5.8%
6.7%
4.7%
6.6%
6.3%
1.7%
0.5%
13.1%
14.6%
11.4%
11.1%
11.5%
11.0%
4.4%
1.1%
18.7%
17.8%
14.8%
14.5%
16.8%
15.1%
7.4%
1.8%
40.5%
24.3%
17.4%
18.6%
20.9%
19.1%
10.8%
2.8%
35.7%
16.0%
21.6%
22.5%
23.8%
23.4%
12.8%
3.8%
28.6%
47.6%
17.2%
30.0%
28.1%
25.6%
17.0%
5.8%
0.0%
50.0%
34.8%
28.4%
31.5%
30.7%
16.6%
5.8%
17.9%
15.8%
12.6%
12.6%
14.1%
13.2%
6.9%
1.9%
Number of Loans by Risk Category
25
Number/Amount of loans
40,000
35,000
30,000
25,000
20,000
Number of Loans
15,000
-U-Average amount
10,000
5,000
A
B
C
E
D
F
G
Interest Rate Range by Risk Category
02509
0.24S
0.2352
0.229
0.2215
02159
Column 2 vs. Column 1
Smooth(Colu.m. 2)
0.1939
0.1891
0.171
0162
U.324
0.1261
0.12183
0.1172
0.1141
-
als 014.
.40.1426
00432
0.0781
0.0692
Credit Grade
3.4 Model Building and Interpretation-Lending Club
This section contains five steps. First, prune the datasets of Lending Club and Prosper for the
model building. Second, select variables and build the logistic model to predict the default
probability. Third, try to interpret the significance of each variable and compare the estimates
with the expectation. Fourth, Choose alternative data models to predict the loan status, as
26
well as net profit/loss, and try to compare the result with conclusion made by logistic
regression. Last, as a robustness check, I will test the linear assumption between predicting
variables and target prediction, and try to explore the nonlinear relationship between target
prediction and each individual predicting variable.
3.4.1 Data Preparation
In the data preparation, I tried to only incorporate parameters that can be somewhat verified. There are
definitely some variables such as loan purposes that borrowers can fabricate subjectively. Even
though we can build a model with a good performance using those subjective parameters, the
reliability of the model is questionable.
1) Homeownership. The original options for this variable include "rent", "own", "Mortgage",
"None", "Other". We create dummy variable, considering 1 as "own" or "mortgage" and
0 for the rest. Answers of "own" and "Mortgage" are considered as 1, and the rest as 0.
2) There are over 300,000 rows of data; all current listings are excluded from the dataset
since we're aiming to detect any indicators of risks from an investor's perspective.
3) Loan Status is the target to predict. Loan status. Loan status of "0" represents active loans
that already finished all payment or that are still in payment process. "I" represents
default loans including charged-off, default, or delinquencies more than 31 days (since
there are only two categories for delinquent loans, less or equal to 30 days or more than
31 days). Initially, there are 87880 "completed" loan listed on Lending Club, while my
interest is to look at loans that either finished all payments or declared default already.
Keeping that in mind, I further split completed loans into two categories - paid and
in-process. Within completed loans, there are only 5509 loans that already finished all
27
payments. The remaining 82371 completed loans are still in payment process. However,
as shown in the below graph, 50% of bad loans declared default before Ih month. Or
75% of bad loans declared default before 171 month. This implies that within those
82371 loans that didn't finish all payments, there is a great chance that they will
eventually pay off all installments. Therefore, in order to provide a reliable data model
and mitigate bias toward completed loans, I treat completed loans that have paid at least
17th installments as finished loans, and assume that they won't go default in future. By
doing this, I get 38555 good loans (finished all payments) and 24871 bad loans (default or
charged off).
65
60
60
NO. of Month Paid vs. loan status
3 NO. of Month Paid
00
55
00
50
45
40
35
30
Z)25
20
15
10
0
1
0
loan_status
4) Income verified. "0" represent that the income is not verified while "1" means income
verified.
5) Independent variables involved in the regression: Loan amount, term, employment length,
homeownership, annual income, if the income is verified, debt to income ratio, FICO
credit score, open account, revolving credit balance, the utilization ratio of revolving
credit balance, total account. I excluded the variable "purpose" from the model due to the
28
low reliability of the value that borrowers put when they applied for the loan.
6) The whole dataset will be divided into training and validation. The whole dataset is
randomly partitioned into 43426 training rows and 20000 validation rows
7) Profit/Cost matrix. I need a cutoff value in order to classify the predictions into 0 or 1. To
do that, I need to compute firstly the profit/cost matrix for Lending Club. There are 63426
loans in the dataset, including 38555 good loans and 24871 bad loans. Good loans
generate $108,339,408 out of the total original amount $450,364,975, representing a ROI
of 24.1%. Bad loans cost investors a total loss of $219172141, out of the total original
amount $350771625, representing a negative ROI of 62.5%. Finished loans as a whole
causes a loss of 110,832,732 out of the total amount $801,136,600, representing negative
ROI of 13.8%. You might be surprised that the real ROI that Lending Club offers to
investors is actually much lower than the one it advertises on the website. The profit/cost
matrix should be as below.
Profit Matrix
Actual
Predicted
Loan Status
0
1
0
1
-1
1
-2.6
0
3.4.2 Model Building
Before building the model in each step, I selected variables based on R-Square, AIC and BIC
rules. Then I compared the performance of models using different variable combinations. 1)
R-Square oriented stepwise selection intends to remove open acct from the model. 2) A
minimum AIC recommend further removing home-ownership from the data model. 3)
29
Selecting to use Minimum BIC also gives the same result of excluding open acct and
homeownership from the model. Detailed results are listed below.
Entered
[X]
[X]
[X]
[X]
[XI
[XI
[XI
[XI
[XI
[X]
[X]
Entered
Maximize Rsquare
Parameter
Intercept[1]
loanamnt
term
emplength
homeownership
annualinc
isincv
dti
FICOScore
openacc
revolbal
revolutil
Minimum AIC
Parameter
Sig Prob
1
8.30E-70
3.00E-233
5.00E-15
0.51441
1.30E-41
6.81E-09
3.20E-84
0
0.88003
3.76 E-09
4.57 E-06
Sig Prob
I
[X]
[X]
[X]
[X]
Intercept[I]
loanamnt
Term
emplength
home ownership
[X]
[X]
[X]
[X]
annualinc
isincv
Dti
FICOScore
open acc
8.30E-70
3.OOE-233
5.OOE-15
0.51441
1.30E-41
6.81E-09
3.20E-84
0
0.88003
[X]
[X]
revolbal
revolutil
3.76E-09
4.57E-06
Entered
[XI
[XI
[XI
[X]
[X]
Minimum BIC
Parameter
Sig Prob
Intercept[1]
1
loanamnt
term
emplength
homeownership
8.30E-70
3.OOE-233
5.OOE-15
0.51441
1.30E-41
annualinc
30
[XI
isincv
6.81E-09
[XI
[X]
dti
FICOScore
open_acc
[XI
[X]
revolbal
revolutil
3.20E-84
0
0.88003
3.76E-09
4.57E-06
Based on the result from data selection, I ran the logistic regression Estimates of parameters
under slightly different variable combinations are listed below. There is no significant value
or sign difference between the two results. Besides, RSquare-oriented variable combination
offers a RSquare of 0.2135, while AIC/BIC selected variable combination gives only a
slightly lower RSqure -- 0.2134.
Estimate
Term
Maximize
Rsquare
Minimum
AIC/BIC
Intercept
loanamnt
Term
empjength
homeownership
annualinc
isincv
Dti
FICOScore
revolbal
revolutil
-10.66162
-0.00003
-0.03942
-0.02573
0.01513
0.00001
-0.13985
-0.03298
0.01900
0.00001
0.21735
-10.67306
-0.00003
-0.03937
-0.02533
N/A
0.00001
-0.13967
-0.03296
0.01902
0.00001
0.21590
Since the model using parameters selected by RSquare stepwise offers slightly better result, I
computed the formula as below accordingly.
1
P(Default) = 1 +
eO-(-0.66162+PiXi)
fli: Coeff cient of parameter
X1 : Parameters
The confusion matrix generated from two combinations is listed below. Both models achieve
31
the best performance under a cutoff value of 0.44, meaning that if the default probability
equals to or is bigger than 0.44, the loan will be determined as default, vice versa. The overall
accuracy rate of the two combinations is close to 69.1% for RSqure combination and 68.8%
for AIC/BIC. The former one does a better job in identifying good loans, while the latter one
is more accurate in identifying bad ones. Both combinations can improve the overall ROI of
Lending Clubto negative 1.2% by AIC/BIC combination and to negative 1.7% by RSquare
combination. Even though the risk return after enhancement is still negative, a progressive
step has been made by imitating 12% loss. Not surprisingly, there is a price paid to improve
the overall risk adjusted return to investors. Applying this model means the overall volume of
loan origination will decline by 37.8%, while this improvement in risk adjusted return can
help amass the credit worthiness for P2P platforms and attract more investors thus borrowers
in the long run.
Confusion Matrix-RSquare
Actual
Predicted
loan Status
0
1
0
9180
3256
1
2923
4621
Confusion Matrix-AIC/BIC
Actual
Predicted
loan Status
0
1
0
8959
3099
1
3144
4778
3.4.3 Model interpretation
In this section, I will analyze the estimates of parameters concluded in model building, and compare
the result with business intuitions held by the public. To make the interpretation more clear, when a
32
parameter is claimed to have a positive impact to default rate, it means the higher the value the
parameter have, the higher default probability the loan involves, and vice-versa.
Several papers also tried to interpret the impact of parameters. FICOScore has a negative impact to
default rate, while debt-to-income ratio and credit line utilization have a positive impact (Riza, Yanbin,
Benjamas and Min, 2015). However, when looking at the result from the model that only included the
finished loans, some of estimates of variables are not intuitive. This section will start from interpreting
variables that are counter-intuitive with our expectation, and then go through those that match the
expectation. 1) "Loan amnt" has a negative impact to the default probability. Normally, a higher
Loan amnt gives people an image of involving higher risk, while it turns out that this is not the case.
2) The same to "term". There are two time length allowed on Lending Club - 36 and 60 months.
Generally speaking, given all the other features constant, 60-month loan doesn't contain a higher
default risk than 36-month. This might explain that Lending Club only approves a longer term loan if
the borrower is more qualified. 3) "Home_ownership". Owning a real estate doesn't necessarily mean
that you're more credit worthy. It's actually the opposite. 4) "Annualinc". A higher income put by the
borrower when applying for a loan won't guarantee a better consequence. The impact of this variable
should be considered with " is incv", which has a negative impact to the default rate. 5) "dti-debt" to
income ratio. This ratio also has a negative impact to the default rate. This impact could be explained
that some income information of borrowers is fictive. Further research in the paper will only include
loans with verified income to detect any different result. 6) One most surprising finding is that
"FICOScore" has a positive impact to the default rate. People might think that borrowers with higher
FICOScore normally have better credit quality, since the credit score backed by a
3 rd
party agency is
normally very reliable. However, on Lending Club (and also later mentioned in Prosper's model),
33
FICOScore is not a good indicator of the credit quality. Lenders can't simply make the decision
based on this score, which is actually what lots of investors are doing. 7) "revol_util" and "revolbal"
have positive impact to default rate, which is consistent with expectation. Because the majority of
borrowers on Lending Club are applying for loans to coordinate personal credit lines, a higher balance
and utilization ratio indicate a higher financial pressure of paying back the balance.
3.4.4 Robustness Check
Besides building the model to predict nominal target parameter, I also considered using the same
predicting variables to predict the numeric parameter-net profit/loss, to check the numeric regression
outperforms logistic regression. The same as the previous section,
I prune the predicting variable
combination oriented by RSqure, AIC and BIC and list the result below. Three ways to rule out
variables give the U.S. the same result-to keep all variables in the linear regression model.
Entered
Parameter
Estimate
[XI
[XI
[XI
[X]
[X]
[X]
[X]
[XI
[XI
[XI
Intercept
loan_amnt
term
emplength
annualinc
is_inc_v
dti
FICOScore
revolbal
revolutil
-13687.535
-0.1754355
-106.27022
-44.95143
0.00523282
-239.96839
-96.287356
27.2358601
0.01584572
1126.27626
Looking at the estimates of variables in a linear regression, it makes more intuitive sense than the
result from the logistic regression. For instance, "loanamnt", term and" dti" have a negative
coefficients with net profit in a sense that the higher value the variables have, the lower profit or
higher loss that the loan will cause investors. By contrast, FICO_Score, and annual_ inc place positive
to the loan's net profit/loss. The model generates an RSquare of 0.1072, which is significantly lower
34
than the value by logistics model. To further test which model is superior to the other one, I also draw
the confusion matrix for linear regression model by setting up a profit/loss value as cutoff of good or
bad loans. Under a cutoff value of net profit/loss of negative $2,100, the model achieves the highest
accuracy of 67%, which could be further broken down to 74% of identifying good loans and 55%
accuracy of identifying bad loans. However, the performance of this model is still worse than the
logistic model.
Confusion Matrix-RSquare
Actual
Predicted
loan Status
0
1
0
9152
3422
1
3146
4258
The different coefficient of the same parameter to default probability and net profit can be understood
by twofold way. First, the amount of net loss outweighs that of net profit significantly, therefore the
positive impact imposed by FICOScore or annualinc can't bring enough profit to push the net P/L
to positive numbers. 2) However, it's true that higher FICOScore and annual inc can reduce the net
loss if loans go default, and can also increase the positive return if loans are proved to be good.
I also used discriminant and neural network to classify good and bad loans and got confusion matrix
listed below. Literally, both models outperform logistic model in the overall accuracy and net profit if
applying the cost matrix to the results below. The overall accuracy of discriminant is 68% with a
further breakdown of 70% accurate for good loans and 65% for bad loans. Using neural network, the
accuracy turns out to be 69%, with 76% accurate for good loans and 59% for bad ones. However,
there are two key disadvantages of discriminant and neural network. One is that the structure of the
model is non-transparent and user can't interpret the importance of each parameter. Investors can't
apply the model easily when making investment decisions. Another disadvantage is both model need
35
to be changed dynamically whenever there is a new data entering the original dataset.
Confusion Matrix-Discriminant
Actual
Predicted
Loan Status
0
1
0
8608
2680
1
3690
5000
Confusion Matrix-Neural Network
Actual
Predicted
loanstatus
0
1
0
10268
3932
1
2030
3748
Another key robustness check is to test the assumption of a linear relationship between predicting
parameter and target prediction (P/L). To do that, I test the optimal structural relationship.
I use
RSquare as the rule to judge the optimal exponentiation. Detailed results are listed below.
Predictor
Formula
Rsquare
LoanAmnt
Term
emplength
homeownership
Annual inc
isinc v
dti
FICOScore
open acc
revolbal
revol uti
Quintic
Linear
Quintic
Linear
Logistic 3P
Linear
Quintic
Quartic
Quartic
Quintic
Logistic 3P
0.0567
0.053
0.0043
0.0005
0.0019
0.0172
0.0244
0.0169
0.0077
0.0082
0.0055
After having the formula of each parameter, I return firstly to linear regression model by using newly
formularized parameters together to predict net Profit/loss, and compare with the previous linear
regression model to check if the performance is better off. Coefficients and selection result are listed
below.
Entered
Parameter
36
Estimate
[X]
[X]
[X]
[X]
[X]
[X]
[X]
Intercept
Loanamnt
Term
Emplength
Annual income
dti 2
FICOScore 2
open acc 2
[X]
[X]
[X]
[X]
Revolbal
Revoluti
homeownership
isincv
2103.486
0.777
0.612
0.404
-0.555
0.778
1.021
0.000
-0.124
-0.687
174.240
-96.392
By using new formularized parameter linear regression model achieves the best performance under a
cutoff value of negative 2300 loss, meaning that loans with a potential loss that equals or are bigger
than 2300 will be marked as default, otherwise as good loans. The overall accurate rate equals to
68.4%, with an accuracy of 80% of good loans and 50% accurate in identifying bad loans. This model
outperforms previous linear model by subtle advantage, while it is still not as good as discriminant or
neural network.
Confusion Matrix-RSquare
Actual
Predicted
loan Status
0
1
0
9828
3843
1
2470
3837
I further tried to use the newly formularized parameter to predict loan status by using logistic
regression, and received the below result. This model outperforms the previous logistic model by 16%,
with an overall accurate rate of 70%. I haven't went extra miles to explore if the discriminant and
neural network using new parameters, but a reasonable guess would be the performance of these two
models will also improve if doing so. Testing formula or additional data structure uncovers the
necessity of investigating the nonlinear relationship between predictors and target parameter.
37
Confusion Matrix
Actual
Predicted
loan Status
0
1
0
9609
3377
1
2689
4303
Risk Premium by Lending Club
Another important topic is to assess if the interest rate charged to borrowers on Lending Club
is enough to compensate the potential loss of investors. To achieve this, I used IRR and FCFF
to compute the rate and PV for each loan (including current loan). When using IRR, we also
need to estimate the number of terms that investor can receive installments on average, and
then combine it with the probability of default for each loan. For FCFF, it's important to find
the proper discount rate for each loan. Due to the distinctions among loans, the discount rate
that needs to be used is also identical. The higher the risk, the higher the discount rate should
be. I used the interest rate computed from the regression model as the discount rate with the
possible number of terms that investor can receive installments.
3.5 Model Building and Interpretation-Prosper
3.5.1 Data Preparation
There are 230448 rows in Prosper's dataset, including 151903 current loans and rest are
either completed or default loans. Even though the data layout of Prosper is slightly different
from Lending Club, major variables are still available for Logistic Regression and I will
compare features between Lending Club and Prosper in the end of this section. We will build
the prediction model for Prosper, following the same rule with Lending Club, and interpret
and visualize what we conclude from the model.
38
1) We also take out all current loans and just look at the completed and default loans of
Prosper. In addition, I checked if the one specific loan has completed its payment terms
by cross checking column "month since loan origination" and "term". Given this purpose
of only analyzing finished loans, I also excluded all loans with a status of "completed",
"past due <15 days" or "15 days<past due<30 days," but haven't paid installment for the
15 th
month. Detailed rational will be explained in the model building section.
2) Besides, when filtering data, we also exclude variables missing input for many records.
For example, for the variable "occupation", there are over 8000 rows of missing data. To
keep the integrity and validation, it's better to exclude this variable from our model.
3) 649 out of 78544 loans don't have a credit score. However, credit record from the 3rd
party agency is one of the most critical variables in this paper. In order to maintain the
distribution of the credit score in all completed loans, I also assign credit scores by
following the overall distribution.
4) FICOScore. The same with Lending Club, the credit score recorded on Prosper is in a
range-style. In order to build the model, we need to transfer the range into a specific
number. We use the mean number as independent variable in this section.
5) Credit line. There are three credit line related variables: current credit line, open credit
line and credit line in past 7 years. Due to the large amount of missing value of thr current
credit line and an open credit line, I only use "credit lines in the past 7 years" as numeric
variable.
6) An obstacle facing the model building for Prosper lies in the missing data in Bankcard
utilization rate. Since a major purpose for borrowers applying loans on Prosper is credit
39
consolidation, knowing the credit balance utilization is critical to predict the default rate.
However, there are 10% of records missing the value for this bankcard utilization
percentage. Besides, there are also many missing values of variable "debt-to-income
ratio". For missing value of "debt-to-income ratio", I divided the monthly stated income
by the monthly payment required by the prosper. This will give a proper approximation to
the real bankcard utilization ratio. In terms of the missing records of "bank card
utilization", I replace blanks using "monthly revolving payment" divided by "monthly
stated income". However, there are 44 records with a stated monthly income of "0" and I
assign a bank card utilization ratio "1" to those records.
7) For loan status, I mark 0 as completed trades and I as default/charged-off. "Completed",
"Past due (1-15 days)", "Past Due (16-30 days)", "Past Due (31-60 days)" and
"FinalPaymentInProgress" are marked as 0. Others are marked as 1.
8) Another variable catching my attention is "stated monthly income". There are obvious
outliers in this variable, which are those borrowers stating a monthly income higher than
$15000 per month (some even as high as 60k each month). By contrast, some borrowers
only have a monthly income less than 1 dollar. Eventually, this variable was excluded
from my model.
9) Debt-to-income ratio. This variable shows the ratio of monthly installment over income.
Looking at the distribution of this variable, there are also outliers such as 650 records
with a ratio higher than 1. It makes no sense that a borrower can repay the monthly
payment if it is higher than his or her income level. In addition, this should be a high-risk
indicator that the platform shouldn't pass the loan application from those borrowers in the
40
first place. Those outlier records were also excluded from my model.
10) LP Net principle loss. This column assesses the net loss if the loan goes default. Within
those default/charged-off loans, there are 42 records with a negative net loss, indicating
actual no net loss due to payment collection afterwards. Therefore, I replace the negative
amount with "0". There are also loans 'past due over 60 days" which are not recorded as
net principle loss because they haven't been officially marked as default loans. In my
model, I treat those loans past due over 60 days as default, thus calculated the net
principle loss by computing the difference between the loan origination amount and
customers' payments.
11) After pruning the dataset in the ways listed above, I get 53637 rows of records in total by
only including loans that either finished all payments or claimed to be default or charged
off.
12) Interestingly, there are 200 loans originated after the
3 rd
quarter of 2014 and are already
marked as either default or past due over 61 days.
13) In order to avoid over-fitting in my model, I partitioned the whole dataset into 33637
rows of training data and 20000 rows of validation data. The optimal model selected is
aiming to maximize the accuracy of prediction for validation model.
14) Cost/Profit Matrix. In order to compute the cost/profit matrix of identifying bad loans and
good loans, I first need to calculate the total return and ROI of bad and good loans
respectively. For good loans amounted to a total size of $183,514,490, the total net return
is $40,276,694, representing a 21.9% return rate. For bad loans sized at $147,109,890, the
total net loss is negative $106,944,809, representing a 72.7% loss. Prosper as a whole
41
generates a negative 20.2% return for investors in terms of all finished loans. Therefore,
the cost matrix should be as below.
Cost Matrix
Predicted
Actual
Loan Status
0
0
1
1
-1
1
-4
0
However, this is only the result of purely looking at all finished loans. I realized that this
result might exaggerate the portion of bad loans because the bad loans tend to become
default in the early months of their payment term. The below distribution of default
months proves my hypothesis.
Default month vs. Term
56
54'1
52:
50
48,
46Te
44~
42'
40:
38a
346
c 32;h
30 2
28
75 26
S24-
Default
a
month
-
o22
20-'
18
16:
14
81
-2'
12
36
60
Literally, loans of 12-month term went default no later than the 1 1 month, most likely
before 6 months. For bad loans with terms of 36 or 60 months, they tend to went default
no later than the
30 *
month, 75% of them went default before 15 months. Therefore, a
more fair way to compute the cost matrix is to also take consideration for those
"unfinished" loans that have successfully paid more than the average default-month
payment. For instance, for completed loans past their Il* months, I treat them as good
42
loans. The same logic is applied for 36 and 60-month loans. After incorporating
unfinished loans, which I consider a high possibility of being a good loan, I had a new
cost matrix. All good "finished" loans generate a return of $88,705,490
out of
$297,814,413, representing 29.8% ROI. For bad loans, the total loss sum up to
$105,208,364 out of $1,471,098,890, representing a loss rate of 72.5%. Prosper as a
whole generates negative 3.7% net loss for investors. Therefore, the profit matrix of
identifying good or bad loans is listed below. The purpose of the following model
building is to find a model to optimize the overall profit.
Profit Matrix
Actual
Predicted
Loan Status
0
1
0
1
-1
1
-2.4
0
3.5.2 Model Building
After having the cost matrix and a dataset pruned, I ran different data models in order to find
an optimal way to distinguish a bad loan from good ones in order to minimize loss for
investors. This section starts from Logistic regression and also explored other model later on.
After selection in the previous section, there are 12 independent variables left in our
model-"
Term,
IsBorrowerhomeowner,
TotalCreditlinespast7years,
OpenRevolvingMonthlyPayment,
CurrentlyinGroup,
BankcardUtilization,
DebtToIncomeRatio,
LoanOriginalAmount".
43
FICOScore,
OpenRevolvingAccounts,
IncomeVerifiable,
Before building the model, I firstly select variables based on different standards, including
highest R square, minimum AIC or BIC. I haven't used P value threshold since I hold the
position that subjectively select the P value threshold to.enter or remove variables from the
model is unreliable. 1) In R square oriented selection"creditlineinpast7years" is suggested to
be removed from the variable combination. 2) In the methodology of minimum AIC, I got the
same result as in step 1, to remove "creditlinesinpast7years". 3) In the methodology of
minimum BIC, besides the variable identified to be removed in former two steps,
"openrevolvingmonthlypayment" was also removed from the combination. All selection
details are listed below. Model will be built by using different variable combinations.
Highest RSquare
Entered
Parameter
Sig Prob
[X]
[X]
[X]
Intercept[1]
1
4.70E- 17
0
0.16464
7.11 E-06
0.09087
7.30E-20
2.1 OE-56
3.20E-93
4.OOE-107
1.OOE-111
[X]
[X]
[X]
[X]
[X]
[X]
[X]
IsBorrowerHomeowner{l 1-0}
FICO Score
TotalCreditLinespast7years
OpenRevolvingAccounts
OpenRevolvingMonthlyPayment
BankcardUtilization
DebtTolncomeRatio
IncomeVerifiable{ 1-0}
LoanOriginalAmount
Term
Minimum AIC
Entered
Parameter
Sig Prob
[X]
[X]
[X]
Intercept[1]
1
IsBorrowerHomeowner{ 1 -0}
FICO Score
TotalCreditLinespast7years
4.70E- 17
0
0.16464
OpenRevolvingAccounts
OpenRevolvingMonthlyPayment
7.11E-06
0.09087
[X]
[X]
44
[X]
[X]
[X]
[X]
[X]
BankcardUtilization
DebtToIncomeRatio
IncomeVerifiable{ I -0}
LoanOriginalAmount
Term
7.30E-20
2.1 OE-56
3.20E-93
4.00E- 107
1.OOE-111
Minimum BIC
Entered
Parameter
Sig Prob
[X]
[X]
[X]
Intercept[1]
IsBorrowerHomeowner{l -0}
FICO Score
TotalCreditLinespast7years
[X]
OpenRevolvingAccounts
OpenRevolvingMonthlyPayment
I
4.70E- 17
0
0.16464
7.11 E-06
0.09087
[X]
[X]
[X]
[X]
[X]
BankcardUtilization
DebtToIncomeRatio
IncomeVerifiable{1-0}
LoanOriginalAmount
Term
7.30E-20
2.1 OE-56
3.20E-93
4.OOE-107
1.OOE-111
I. I firstly use the variable combination selected by using R-Square methodology and
minimum AIC. Variable estimate and confusion matrix are listed below. This model
generates a RSquare of 0.1291 for validation dataset. The overal error rate is 33.7%,
with a breakdown of 27.5% error rate for good loans and 46.5% error rate for bad
loans. Applying the profit matrix to maximize the overall profit, I detected a cut-off
value for Probability-35% (If the predicted probability is bigger than 35%, the model
identifies the loan as bad, otherwise as good).
Term
Estimate
Std Error
Intercept
IsBorrowerHomeowner[0]
FICO Score
-4.85106
0.09597
0.00982
0.13086
0.01155
0.00018
45
OpenRevolvingAccounts
OpenRevolvingMonthlyPayment
BankcardUtilization
DebtTolncomeRatio
IncomeVerifiable[0]
LoanOriginalAmount
Term
0.01642
-0.00006
0.29713
-0.97204
-0.43483
-0.00005
-0.02926
0.00307
0.00004
0.03254
0.06161
0.02111
0.00000
0.00131
Confusion Matrix
Actual
Predicted
Loan Status
0
1
0
9595
2970
1
3634
3417
II. Next, I further removed "OpenRevolvingMonthlyPayment",
as indicated by minimum
BIC selection. This combination generates a RSquare of 0.1363, slightly lower than
the one by former two. Variable estimate and confusion matrix are listed below. The
overall error rate of this model is 33.7%, with a breakdown of 27.5% error rate for
good loans and 46.5% error rate for bad loans. The model achieves the best
performance at a cutoff value of 35%.
Term
Estimate
Std Error
Intercept
IsBorrowerHomeowner[0]
FICO Score
OpenRevolvingAccounts
BankcardUtilization
DebtTolncomeRatio
IncomeVerifiable[0]
LoanOriginalAmount
Term
-4.84518
0.09907
0.00982
0.01377
0.28383
-0.98470
-0.43614
-0.00005
-0.02920
0.13078
0.01139
0.00018
0.00261
0.03150
0.06114
0.02110
0.00000
0.00130
Confusion Matrix
Actual
Predicted
Loan Status
0
46
1
9593
2968
0
1
3636
3419
Removing variables of credit lines and revolving monthly payment doesn't improve the
accuracy of the model. However, it's more convenient to predict the loan status using as few
variables as possible. A formula to predict probability of a loan going default is concluded as
below.
P(default) =
1
1 + e--.4 8+
ii
f: Estimate of predicting variables
X: The value of predicting variables
i: Identifier of predicting variables
3.5.3 Model interpretation
Based on the confusion matrix concluded in the previous section, the model can correctly
predict 72.5% good loans and 53.5% bad loans. Applying this matrix into the real P/L
statistics of Prosper, this model can improve the overall P/L of Prosper from negative
$16,502,874 to a positive return $15,435,037, representing a ROI of 5.4%. This result implies
that investors can gain a positive return if they can fully diversify their portfolio.
There are several features derived from the model as following. I interpret variables in a
sequence from "unexpected" to "expected" results. 1) Unexpectedly, FICOScore has a
positive contribution to the default probability of the loan, denoting that higher credit rating
score can't guarantee a better performance of the loan. This finding is surprising because
normally result from 3 party credit rating agency should be a good indicator of the loan
quality. In order to ensure that this finding is valid, I also looked at the distribution of
47
FICOSore for good and bad loans and the box plot was shown below. Good loans have a
higher FICOScore in the 25%, mean, 75% and highest score than bad loans. However, there
are outliers for both good loans and bad ones that might intervene the accuracy of
FICOScore. Unfortunately, even after removing all outliners in terms of FICOScore, it still
contributes positively to the default probability.
FICO Score vs. LoanStatus
900
Q
820
800
780
760
740
720
700
680
660
S640
8 620
Y600
580
560
540
520
50048D
460
440
420
400
380
360
FICO Score
-
880
860
840
1
0
LoanStatus
Also, as part of the robustness check, I ran the covariance report among variables. Detailed
report is listed below. 2) Owning a real estate is not a sign of the good credit quality. 3)
Higher "DebtTolncomeRatio will reduce the default risk of loans. 4) "LoanOriginalAmount"
imposes a negative impact toward the default risk. 5) In line with the expectation, open
revolving accounts and bankcard utilization place a positive impact to the default rate. More
revolving accounts the borrower has, or higher ratio of bankcard utilization, indicates a lower
credit quality and a higher default probability. Because the majority of the purpose that
borrowers apply for loans on Prosper is credit coordination, which means they borrow money
from Propser to offset the due balance on their revolving credit line. 6) IncomeVerifiable.
Although it's unclear how Prosper verifies the personal income, the credibility will increase if
48
the borrower's income level could be verified.
3.5.4 Robustness Check
Robustness check will be conducted in this section. As a start, covariance is assessed among
variables to ensure a high independence for each variable. A detailed report is attached as
appendix Exhibit I. As shown in the table, there is no obvious co-relation among independent
variables. Furthermore, I ran several other models to compare the performance of different
models and reliability of assumptions.
1) Linear Regression Model. I firstly tried a linear regression model using profit/loss amount
as target variable. The intention is that after having the model, investors can choose a
reasonable cutoff value based on their risk aversion level. Since the procedure of building the
model is similar with logistic regression, concrete information won't be provided here. The
RSquare is as low as 0.04 for validation dataset, therefore I won't interpret the meaning of
estimate of parameters. A screenshot of result is listed below.
Entered
Parameter
Estimate
[X]
Intercept
IsBorrowerHomeowner
[X]
FICO Score
TotalCreditLinespast7years
OpenRevolvingAccounts
OpenRevolvingMonthlyPayment
[X]
[X]
[X]
[X]
[X]
BankcardUtilization
DebtToIncomeRatio
IncomeVerifiable
LoanOriginalAmount
Investors
-6923.87
0
8.802476
0
0
0
724.0901
-1276.9
1588.358
-0.17626
2.251251
2) Next, I used partition method to classify new loans to groups. The model outperforms
logistics regression model in terms of RSquare, but it generates a lower net profit for Prosper
49
as a whole. The confusion matrix is listed below. The net profit turns out to be $4,294,094,
representing a 1.2% ROI for investors.
Confusion Matrix
Actual
Predicted
Loan Status
0
1
0
12025
4388
1
1597
1875
3) As a method to classify listings, Neural Network normally outperforms other models in
terms of RSquare and accuracy of prediction. However, a major disadvantage of this model is
its non-transparency. The intrinsic logic is unseen by users, thereby users can't really uncover
the importance of determinants. Neural Network offers the highest RSquare so far -- 0.1439.
However, the overall net profit that the model generates is negative.
Confusion Matrix
Actual
Predicted
Loan Status
0
1
0
12239
4904
1
954
1359
4) 1 also applied the discriminate classification method to predict the loan status. Confusion
matrix is listed below. The overall error rate of this model is 36.8% and generates a positive
profit of $15,155,394, representing a 6% percent ROI. However, one point worth noticing is
that the model achieves the performance by compromising over 45% of the total revenue of
Prosper, making it unacceptable.
Confusion Matrix
Actual
Predicted
Loan Status
0
1
0
28721
8250
50
1
15704
12432
5) Last but not least, I also test the linearity between independent variables and target
prediction. So far all the models I used assume a linear relationship between predicting and
target variables, while the real situation might differ from that assumption. In order to test
that assumption, I tried to test the best non-linear relationship between target prediction and
individual predicting variable to finalize the nonlinear factor. For instance, for FICOScore,
logistic 3P, 4P, 5P, from quadratic to quantic, and other exponential relations are tested to find
the optimal fit model. Eventually, a quantic FICOScore gives the best RSquare.
I firstly used net P/L as target prediction and test the optimal dimensions of the relationship
between the target and independent variables. A table with detailed result is listed below. As
you might see, the R-Square using the individual variable is quite low, while I just want to get
a sense of which non-linear model should I use if I combine all independent variables all
together.
Varibale Name
Dimentsion
R-Square
FICOScore
Bankcard
Debttoincome
NO. of credit lines
OpenRevolvingMonthlyPayment
Cubic
Exponential 4 p
Logistic 4P
Logistic 4P
Quintic
Biexponential 5P
0.0046
0.0017
0.007
0.00011
0.0013
0.0248
LoanOriginalAmount
After revising the predicting variable based on the table concluded above, I ran the regression
to test whether the model is better off after applying nonlinear relationship. However, the
performance of numeric regression is still disappointing at this point. The model generates an
R-Square as low as 0.04.
Using the model built through previous step, I further ran the logistic regression model to
51
predict the loan status. This time, the model generates an R-Square of 0.1125 and confusion
matrix as below under cutoff value of 0.35 of the default probability. This model, with a
nonlinear assumption between prediction and target, offers a better performance. Due to the
limitation of further skillset in machine learning, I won't proceed exploring the more
complicated nonlinear models.
Confusion Matrix
Actual
Predicted
Loan Status
0
1
1
0
9715
2968
3514
3419
As a hint for further research on P2P lending credit modeling, effects that can widen the gulf
between good loans and bad loans should be amplified. In Prosper's dataset, I draw the
distribution of each predicting variable versus loan status, and found out that the distribution
of variables
is
almost
identical
for
bad
and
good
loans,
except
FICOScore,
StatedMonthlyIncome and LoanOriginalAmount. Graphs portraying the distribution for
StatedMonthlyIncome and LoanOriginalAmount are listed below. As stated earlier in this
paragraph, a more accurate model needs to amplify the difference in terms of these three
distinguisher variables as much as possible.
52
StatedMonthly~ncone vs. LoanStatus
I
15000
14000
13000
12000
11000
10000
9000
1 StatedMonthylncome
8000
7000
S6000
S5000,
4000
3000
2000
1000
0
0
LoanStatus
36000
34000
32000
30000
28000
26000!
24000
22000
c
20000
18000
0 16000
14000
12000
10000
8000
4000
2000
0
LoanOrIglnalAmount vs. LoanStats
C LoanOriginalAmount
Iee
*0
0
1
0
LoanStatus
3.6 Comparison of Findings in Model Building for Lending Club and Prosper
This section will compare findings from Lending Club and Prosper in terms of similarities
and differences and it will try to interpret the rationale behind it. As a conclusion, this section
will also synthesize lessons for China's P2P operators and regulators to further scrutinize and
boost the healthy development of P2P lending industry.
3.6.1 Similarities
1) A major similarity that jumps into our eyes is the negative return generated by Lending
53
Club and Prosper, a return that is quite different from what is claimed through official
channels. If I only involve all finished or default loans, the ROI figure would be even worse.
2) The performance of Lending Club and Prosper in a time series is improving, indicating a
learning process of P2P lending itself. 3) There are only subtle differences between bad loans
and good loans regarding predictor parameters. In order to achieve a high accuracy, models
need to incorporate more parameters and complicated structures. 4) The profit/confusion
matrix introduced to Lending Club and Prosper is nearly identical, indicating a neutral
phenomenon in the P2P lending industry in the U.S. that is not hugely different across
different platforms. The cost ratio for Lending Club is 1:2.6 and 1:2.4 for Prosper in terms of
identifying good loans and bad loans. 5) All same predicting parameters on Lending Club and
Prosper have a same-parity impact towards the default rate, meaning that the same parameter,
for instance "debt to income ratio" has a negative impact to default risk, both on Lending
Club and Prosper. 6) A data model with nonlinear structured
predicting parameters
outperforms models with linear assumption. 7) The optimal error rate for both Lending Club
and Prosper are close to 30%, and the model does a better job in identifying good loans than
bad loans. 8) In order to improve the overall ROI for investors, both Lending Club and
Prosper need to compromise more than 30% of their loan volume.
3.6.2 Differences
1) Although both platforms generate negative returns for investors, Prosper outperforms
Lending Club due to more conservative strategies, including fewer options of terms and
credit screening process (I tried to apply for a loan using my student status on both Prosper
and Lending Club. Prosper rejected me right away for credit quality reasons, but Lending
54
Club offered me the listing, but with higher interest rate). 2)
I might have exaggerated
lending Club's loss due to the reason that there is no data disclosed regarding the default loan
collection. I can expect that certain portion of principle amount could be collected back after
Lending Club outsourced the collection process to collecting agencies. Besides, regarding
loan status on Lending Club, there is no such category of "past due over 31 days, less than 60
days like what Prosper has. Therefore, treating all loans past due over 31 days as default
might also exaggerate the scale of default loans. 3) Lending Club only discloses the last
payment date, but without the previous detailed amount that indicates how much money
borrowers have paid. This factor also impacts the accuracy of net profit/loss of loans on
Lending Club.
3.6.3 Lessons for China's P2P Lending
A big concern of mine regarding China's P2P Lending is the over-optimism of default rate on
P2P lending platforms. If you look at the major official websites such as Lufax, CreditEase,
or Hongli Capital, you can find that the declared default rate is less than 1%. In a more
matured market like the U.S., the default rate for P2P lending is still higher than 20%, if
involving all completed loans and even as high as 40%, if only considering "finished" loans.
It's hard to imagine that a P2P lending market comprised of shadow banks, less trust-worthy
operators and premature credit systems only carry a default rate as low as 1%. A reasonable
guess is that there is huge risk hidden under the water. The reason that P2P lending platforms
in China can cover this default risk under water is its unique business model. Operators in
China can collect money from continuous new wealth management products and use the
money to pay off promised return to previous investors. This fire-fighting model is not
55
sustainable. There is a rumor that Lufax is facing a huge default risk due to accumulating
default loans on the platform right before this paper.
Counter-intuitively, the higher credit rating score is not a good indicator of higher credit
quality on P2P lending platforms. Public, especially active participants who're hoping that 3 rd
party credit agency could be a life-saver for P2P lending in China should be skeptical at the
real impact imposed by the credit rating. Besides, China's P2P lending operators are well
known for non-transparent structure. Authorities in China need to push more stringent rules to
regulate the disclosure of the current portfolios to investors. Besides, in order to mitigate the
risk for investors, operators need to control the scale of the overall loan amount for the whole
platform and scrutinize more information of borrowers.
Another take away for China's P2P Lending is that the potential risk for P2P lending is quite
high. Before making any decisions of investment on P2P lending, one should firstly assess
his/her risk aversion. Like the cut off value I used to determine if a loan is good or bad, the
judgement call or risk bearing level of different investors can vary significantly. One should
also place a cut off value for his/her loss level in a long run.
4. Conclusion.
4.1. Conclusion of this paper
The IPO of Lending Club and the multiples that investors placed on this stock signaled to the
market that P2P lending could be a Rocket-Science. This is another case where Web 2.0
revolutionizes the traditional banking industry. It's been only 10 years since the concept of
P2P lending was established in the UK, and the whole industry is still in its growing lifecycle.
56
Till now, the P2P lending industry is still premature in terms of a full eco-system, complete
legislative environment and market rules. There are more and more stakeholders participating
in P2P lending, such as payment collecting agencies, data filtering and analysis providers,
and news channel
operators specifically
serving the P2P lending
industry. A more
comprehensive eco system needs to be built that involves the complete market dynamic and
can impose strict requirements for major stakeholders. As an emerging market for P2P
lending, China firstly imitated the business model from western countries and further
developed according to its own specific business environment. Compared to the U.S., China
has a longer way to build the credit system, regulations and transparent business model. By
analyzing the data from Lending Club and Prosper, a big concern I inferred toward China's
P2P lending market is its artificially low default rate. I believe that a huge default risk is still
hidden under the water for China's P2P lending industry.
Three conclusions could be drawn from this paper. First, the actual ROI that P2P lending
platforms generated for investors is much lower than what is claimed through official
channels, and even negative. Only considering matured loans, the ROI that Prosper and
Lending Club offers to investors is negative 20.2% and negative 13.8% respectively. This
conclusion could be referred to when considering the real default rate of P2P loans in China's
Market. Besides,
3 rd
party credit rating in the U.S. is not a good indicator of potential risks as
expected, along with other parameters also having an unexpected impact toward the default
rate. Second, higher credit rating score, lower debt-to-income ratio or lower bankcard
utilization rate doesn't lead to lower default rate. Therefore, investors should not purely refer
to those ratios separately when making investment decisions. Last but not least, investors can
57
use the formula computed in chapter 5 to identify good and bad loans and diversify the
portfolio as much as possible to avoid adverse selection.
4.2. Further research proposed
Further research on the principle collecting process after default is worth looking into. We
should identify the responsibility and consequences of both borrowers and platforms, once
borrowers default on the P2P lending platforms. Continuous research needs to be done in a
dynamic way to capture major changes on the P2P lending market. In addition, data mining
with machine learning methodologies to explore the nonlinear structure between predictor
parameters and target prediction would be beneficial for both investors and platforms. This
paper doesn't include the corporation model between P2P lending and other financial
institutions, such as hedge fund and regional banks. Comparing features of loans invested by
institutional investors and personal investors will help unveil some logistics of how to better
identify default risk and prevent adverse selection.
5. References
Riza Emekter, Yanbin Tu, Benjamas Jirasakuldech & Min Lu (2015) Evaluating credit risk and loan
performance in online Peer-to-Peer (P2P) lending, Applied Economics, 47:1, 54-70, DOI:
10.1080/00036846.2014.962222
Dongyu Chen, Chaodong Han, A Comparative Study of online P2P Lending in the USA and China.
Journal
of
Internet
Banking
and
Commerce,
August
2012,
vol.
17,
no.2
(http://www.arravdev.com/commerce/jibc/)
Simla Ceyhan, Xiaolin Shi & Jure Leskovec, Dynamics of Bidding in a P2P Lending Service: Effects
58
of Herding and Predicting Loan Success. WWW 2011, March 28-April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0632-4/11/03.
Peter Manbeck (2014), THE REGULATION OF PEER-TO-PEER LENDING: A Summary of the
Principal Issues
Trend Sorbe (2011), PERSON-TO-PERSON LENDING PROGRAM PRODUCT, SYSTEM, AND
ASSOCIATED
COMPUTER-IMPLEMENTED
METHODS.
Provisional
application
No.
61/033,069, filed on Mar. 3, 2008.
Efraim Berkovich (2010), Search and herding effects in peer-to-peer lending: evidence from
prosper.com. Ann Finance (2011) 7:389-405 DOI 10.1007/s10436-011-0178-6
Sergio Herrero-Lopez (2009), Social Interactions in P2P Lending.
Gregor N.F. Weif3a, Katharina Pelgerb & Andreas Horschc (2009), MITIGATING ADVERSE
SELECTION IN P2P LENDING EMPIRICAL EVIDENCE FROM PROSPER.COM,
Sven
C.
Berger
& Fabian
Gleisner
(2009),
Emergence
of Financial
Intermediaries
in
ElectronicMarkets: The Case of Online P2P Lending. BuR - Business Research Official Open
Access Journal of VHB Verband der Hochschullehrer fiir Betriebswirtschaft e.V. Volume 2 I Issue I
I May 2009 I 39-65.
Jefferson Duarte, Stephan Siegel & Lance Young (2012), Trust and Credit: The Role of Appearance in
Peer-to-peer Lending. Published by Oxford University Press on behalf of The Society for Financial
Studies.doi: 10.1 093/rfs/hhs071
Ruilei Li, Yang Guo & Wei Zhang (2013), The successful rate of loan origination and determinants in
P2P Lending. Financial Research, No. 7, 2013, General No. 397.
Ashta, A., & Assadi, D. (2009). An Analysis of European Online micro-lending Websites. EMN 6th
59
Annual Conference (Vol. 33, pp. 4-28). Milan: Fundaci6n Nantik Lum. Retrieved from
http://www.european- microfinance. org/ data/ file/ microlendingwebsites -Doc.
Barasinska, N. (2009). The Role of Gender in Lending Business : Evidence from an Online Market
for Peer-to-Peer Lending. The New York Times. Berlin.
B6hme, R., & P6tzsch, S. (2010). Privacy in online social lending. AAAI 2010 Spring Symposium on
Intelligent Privacy Management (pp. 23-28). Palo Alto: Stanford University. Retrieved from
http://www.aaai.org/ocs/index.php /SSS/SSS10 /paper/ viewPDFInterstitial/1 048/1472
Chemin, M., & De Laat, J. (2009). Can Warm Glow Alleviate Credit Market Failure? Evidence from
Online
Peer-to-Peer
Lenders.
papers.ssm.com.
Montreal.
Retrieved
from
http://papers.ssrn.com/soI3/papers.cfm?abstract id=1461438
Chen, K. Y., Golder, S., Hogg, T., & Zenteno, C. (2008). How Do People Respond to Reputation:
Ostracize, Price Discriminate or Punish? 2nd Intl. Workshop on Hot Topics in Web Systems and
Technologies
(p.
6).
Palo
Alto,
CA:
Hewlett-Packard
Labs.
Retrieved
from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.9701&rep=repl&type=pdf
Collier, B., & Hampshire, R. (2010). Sending Mixed Signals: Multilevel Reputation Effects in
Peer-to-Peer Lending Markets. ACM Conference on Computer Supported Cooperative Work (pp.
1-10). Savannah, Georgia: ACM.
Dhand, H., Mehn, G., Dickens, D., Patel, A., Lakra, D., & McGrath, A. (2008). Internet Based Social
Lending.
Communications
of
the
IBIMA,
2,
109-114.
Retrieved
from
http://www.doa-i.org/doai?func=abstract&id=5 64232
Everett, C. R. (2010). Group membership, relationship banking and loan default risk: the case of
online social
lending. Group.
West Lafayette, IN.
60
Retrieved from Available
at SSRN:
http://ssrn.com/abstract= 114428
Freedman, S., & Jin, G. Z. v. (2008). Dynamic Learning and Selection: the Early Years of Prosper.
com.
com.
working
paper.
College
Park,
MD.
Retrieved
from
http://www.prosper.com/downloads/research/Dynamic-Learning-Selection- 062008.pdf
Freeman, R. E. (2010). Strategic management: A stakeholder approach (p. 276). Boston: Cambrigde
University Press.
Frerichs, A., & Schumann, M. (2008). Peer to Peer Banking - State of the Art. Gttingen.
Galloway, I. (2009). Peer-to-Peer Lending and Community Development Finance. Community
Development Investment Center Working Paper. San Francisco: Federal Reserve Bank of San
Francisco. Retrieved from http://ideas.repec.org/p/fip/fedfcw/2009-06.html
Garman, S. R., Hampshire, R. C., & Krishnan, R. (2008). Person-to-Person Lending : The Pursuit of
( More ) Competitive Credit Markets. Twenty Ninth International Conference on Information
Systems (p. 17). Paris: Association for Information Systems.
Garman, S., Hampshire, R., & Krishnan, R. (2008). A Search Theoretic Model of Personto-Person
Lending. May. Retrieved from http://www.heinz.cmu.edu/research/244full.pdf.
Greiner, M. E., & Wang, H. (2009). The Role of Social Capital in People-to-People Lending
Marketplaces. Thirtieth International Conference on Information Systems (p.
18). Phoenix:
Association for Information Systems.
Greiner, M., & Wang, H. (2007). Building Consumer-to-Consumer Trust in e-Finance Marketplaces.
13th Americas Conference of Information Systems (Vol. 211, p.
Association
for
Systems.
Information
11). Keystone, Colorado:
Retrieved
http://aisel.aisnet.org/cgi/viewcontent.cgi?article= 1721 &context=ancis2007
61
from
Hartley, S. E. (2010). Kiva.org: Crowd-Sourced Microfinance and Cooperation in Group Lending.
Group. Stanford, CA ; New York, NY.
Heng, S., Meyer, T., & Stobbe, A. (2007). Implications of Web 2.0 for financial institutions: Be a
driver,
not
a
passenger
(Vol.
2007,
p.
11).
Frankfurt.
Retrieved
from
http://mpra.ub.uni-muenchen.de/43 16
Herrero-Lopez, S. (2009). Social Interactions in P2P Lending. Proceedings of the 3rd Workshop on
Social
Network
Mining
and
Analysis
(pp.
1-8).
Paris:
ACM.
Retrieved
from
http://portal.acm.org/citation.cfm?id= 17310 11.173 1014
Herrero-Lopez, S., Sheng-Ying Pao, A., & Bhattacharyya, R. (2008). The Effect of Social Interactions
on
P2P
Lending.
Boston,
MA.
Retrieved
from
http://courses.media.mit.edu/2008fall/mas622i/Projects/SergioAithneRahul/Sociallnt
eractionsInP2PLending.pdf
Herzenstein, M., Andrews, R. L., Dholakia, U. M., & Lyandres, E. (2008). The Democratization Of
Personal Consumer Loans? Determinants Of Suc cess In Online Peer-To-Peer Lending
Communities. Online. Newark, DE ; Houston, TX. JIBC August 2011, Vol. 16, No.2, Retrieved
from http://www.prosper.com/downloads/research/democratizationconsumer-
loans.pdf
Herzenstein, M., Dholakia, U. M., & Andrews, R. L. (2010). Strategic Herding Behavior in
Peer-to-Peer Loan Auctions. Newark DE ; Houston, TX.
Hildebrand, T., Puri, M., & Rocholl, J. (2010). Skin in the Game : Evidence from the Online Social
Lending
Market.
Group,
(October).
Retrieved
from
http://www.rhsmith.u-nd.edu/feaconference/docs/Session3 Puri SkinintheGame.pdf
Iyer, R., Khwaja, A.
I., Luttmer, E. F. P., & Shue, K. (2009). Screening in New Credit Markets Can
62
Individual Lenders Infer Borrower Creditworthiness in Peer-to-Peer Lending? Management.
Cambridge, MA.
Jensen, M. C., & Meckling, W. H. (1976). Theory of the Firm : Managerial Behavior , Agency Costs
and Ownership Structure. Journal of Financial Economics, 3(4), 305- 360. Retrieved from
http://tolstenko.net/blog/dados/Unicamp/2010.2/ce73.8/03
SSRN-id94043.pdf
Klafft, M. (2008). Peer to Peer Lending: Auctioning Microcredits over the Internet. Proceedings of the
2008 Int'l Conference on Information Systems, Technology and Management (pp. 1-8). Dubai: IMT.
Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstractid=1352383
Klein, T. (2008). Performance in Online Lending Platforms. Online. Friedrich-Schiller- Universitat
Jena.
Kumar, S. (2007). Bank of One. Empirical Analysis of Peer-to-Peer Financial Marketplace. 13th
Americas Conference on Information Systems (p. 9). Keystone, Colorado: Association for
Information Systems. Retrieved from
http://aisel.aisnet.org/cgi/viewcontent.cgi?article= 1815&context=amcis2007
Larrimore, L., Jiang, L., Gorski, S., Markowitz, D., Zhao, J., & Canlas, K. (2009). Making an Offer
They Can't Refuse: How Borrower Language in Peer-to-Peer Lending Impacts Funding (TOP 3
Student Paper). Chicago, IL. Retrieved from
http://www.allacademic.com/meta/p
mia apa research citation/2/9/9/4/4/p299440 index.html
Lin, M. (2009). Peer-to-Peer Lending : An Empirical Study. 15th Americas Conference on
Information Systems (p. 8). San Francisco: Association for Information Systems.
Lin, M., Prabhala, N. R., & Viswanathan, S. (2009a). Social Networks as Signaling Mechanisms:
Evidence from Online Peer-to-Peer Lending. pages. stern.nyu.edu.College Park. Retrieved from
63
httrp://pages.stern.nyu.edu/~bakos/wise/paipers/wise2009-p09
paper.pdf
Lin, M., Prabhala, N. R., & Viswanathan, S. (2009b). Judging borrowers by the company they keep:
social networks and adverse selection in online peer-to-peer lending. papers.ssrn.com. College Park.
Retrieved from http://papers.ssrn.con/sol3/papers.cfin?abstract
id=l 355679
Livingston, L., & Glassman, T. (2009). Creating a new type of student managed fund using
peer-to-peer loans. Business Education & Accreditation,
http://papers.ssrn.com/sol3/papers.cfm?abstract
1(1), 1-14. Retrieved from
id=1555109
Martinho, L. (2009). Combining Loan Requests and Investment Offers in Peer-To-Peer Lending.
Workshop on Intelligent Agents and Technologies for e-Business (IAT4EB). Universidade do Porto.
Mcintosh, C. (2010). Monitoring Repayment in Online Peer-to-Peer Lending. San Diego.
Nahapiet, J., & Ghoshal, S. (1998). Social capital, intellectual capital, and the organizational
advantage. Academy of management review, 23(2), 242-266. Academy of Management. Retrieved
from http://www.istor.org/stable/259373
Petersen, M. A. (2004). Information: Hard and soft. Northwestern University, Chicago IL. Evanston,
IL: Citeseer. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1 26.8246&rep=rep1 &type=pdf
Phelps, E. S. (1972). The Statistical Theory of Racism and Sexism. American Economic Review,
62(4), 659-661.
Pope, D. G., & Sydnor, J. R. (2008). What's in a Picture? Evidence of Discrimination from Prosper.
com. Journal of Human Resources. Philadelphia, PA. Retrieved from
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:What?s+in+a+Picture?+Evidenc
e+of+Di scrimination+from+Prosner#O
64
Ravina, E. (2007). Beauty, Personal Characteristics, and Trust in Credit Markets. papers.ssrn.com.
New York, NY. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstractid=972801
Rumiany, D. (2007). Internet Bidding for Microcredit: making it work in the developed world,
conceiving it for the developing world. Development Gateway, March. Retrieved from
http://topics.developmentgateway.org/uploads/media/ict/Internet Bidding for Microcredit.pdf
Theseira, W. (2008). Competition to Default? Racial Discrimination in the Market for Online
Peer-to-Peer Lending. Philadelphia, PA.
Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature
review. MIS Quarterly, 26(2), 13-23. Citeseer. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1 1.104.6570
Paul Slattery (2013), Square Pegs in a Round Hole: SEC Regulation of Online Peer-to-Peer Lending
and CFPB Alternative. Retrieved from
https://www.copyright.com/ccc/basicSearch.do?
&operation=go&searchType=0 &lastSearch=simple&all=on&titleOrStdNo=0741-9457
Xubo Wang, Defu Zhang, Xiangxiang Zeng & Xiaoying Wu, A bayesian Investment Model for Online
P2P Lending. J. Su et al. (Eds.): ICoC 2013, CCIS 401, pp. 21-30, 2013.
Binjie Luo & Zhangxi Lin (2014), A decision tree model for herd behavior and empirical evidence
from the online P2P lending market. Inf Syst E-Bus Manage (2013) 11:141-160 DOI
10.1007/s10257-011-0182-4
Radha Vedala & Bandaru Rakesh Kumar (2014). An Application of Naive Bayes Classification for
Credit Scoring in E-Lending Platform
Ruiqiong Gao & Junwen Feng (2014), An Overview Study on P2P Lending. International Business
and Management Vol. 8, No. 2, 2014, pp. 14-18 DOI:10.3968/4801
65
Xue Rui, Bingwu Liu & Shaohua Tan (2012), Bayesian Network Based Causal Relationship
Identification and Funding Success Prediction in P2P Lending. Proceedings of 2012 4th
International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012)
(2012)
IACSIT Press, Singapore
Zhensheng Zhang (2014), CREDIT RISK PREFERENCE IN E-FINANCE: AN EMPIRICAL
ANALYSIS OF P2P LENDING. PACIS 2014 Proceedings. Paper 197.
http://aisel.aisnet.org/pacis2014/197
Gwangjae Jeong, Eunkyoung Lee & Byungtae Lee (2012), Does Borrowers' Information Renewal
Change Lenders' Decision in P2P Lending? An Empirical Investigation. ICEC '12, August 07 - 08
2012, Singapore, Singapore concludes with a summary and the limitations of the study. Copyright
2012 ACM 978-1-4503-1197-7/12/08
Jianxian Qiu, Zhangxi Lin & Binjie Luo (2012), Effects of Borrower-Defined Conditions in the
Online Peer-to-Peer Lending Market. M.J. Shaw, D. Zhang, and W.T. Yue (Eds.): WEB 2011,
LNBIP 108, pp. 167-179, 2012.
Mingfeng Lin, Nagpurnanand R. Prabhala, Siva Viswanathan, (2013) Judging Borrowers by the
Company They Keep: Friendship Networks and Information Asymmetry in Online Peer-to-Peer
Lending. Management Science 59(1):17-35. http:// dx.doi.org/l 0.1287/mnsc. 1120.1560
Chen, Dongyu; Hao, Lou; and Xu, Hong, "Gender Discrimination towards Borrowers in Online
P2PLending" (2013).WHICEB 2013 Proceedings. Paper 55. http://aisel.aisnet.org/whiceb20l 3/55
Hongke Zhao, Le Wu,
Qi Liu, Yong Ge & Enhong Chen (2014), Investment Recommendation in P2P
Lending: A Portfolio Perspective with Risk Management
Jiayu Wu (2014), Loan Default Prediction Using Lending Club Data
66
Current and Complete, matured loan. Cost matrix.
Krystyna Mitrega-Niestroj (2013), RECENT DEVELOPMENTS OF THE P2P LENDING MARKET
IN POLAND
Eric C. Chaffee & Geoffrey C Rapp (2012), Regulating Online Peer-to-Peer Lending in the Aftermath
&
of Dodd-Frank: In Search of an Evolving Regulatory Regime for an Evolving Industry. 69 Wash.
Lee L. Rev. 485 2012
Paul Slattery (2013), Square Pegs in a Round Hole: SEC Regulation of Online Peer-to-Peer Lending
and the CFPB Alternative 30 Yale J. on Reg. 233 2013
Ying Wang & Zhangxi Lin (2014), The Importance of Objective and Dynamic Credit Evaluation in
P2P Lending Market.
Seth Freedman & Ginger Zhe Jin (2014), THE INFORMATION VALUE OF ONLINE SOCIAL
NETWORKS: ESSONS FROM PEER-TO-PEER LENDING, Working Paper 19820
http://www.nber.org/papers/w 19820
Hossein Ghasemkhani, Yong Tan & Arvind K. Tripathi (2013), The Invisible Value of Information
Systems: reputation Building in an Online P2P Lending System
Laura Gonzalez & Yuliya Komarova Loureiro (2014), When can a photo increase credit? The impact
of lender and borrower profiles on online peer-to-peer loans
67
Download