Data Mining Lecture 1: Introduction to Data Mining

advertisement
ICS 278: Data Mining
Lecture 18: Credit Scoring
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Presentations for Next Week
• Names for each day will be emailed out by tomorrow
• Instructions:
– Email me your presentations by 12 noon the day of your presentation
(no later please)
– I will load them on my laptop (so no need to bring a machine)
– Each presentation will be 6 minutes long + 2 minutes questions
• So probably about 4 to 8 (max) slides per presentation
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
References on Credit Scoring
Statistical Classification Methods in Consumer Credit Scoring:
a Review
D. J. Hand and W. E. Henley
Journal of the Royal Statistical Society: Series A
Volume 160: Issue 3, November 1997
Available online at class Web page under lecture notes
Also:
Credit Scoring and its Applications: L. C. Thomas, D. B. Edelman, J. N. Crook,
SIAM, 2002
Credit Risk Modeling, E. Mays (editor), American Management Association,
1998.
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Outline
• Credit Scoring
– Problem definition, standard notation
• Data Sources
• Models
– Logistic regression, trees, linear regression, etc
• Model building issues
– Problem of reject inference
• Practical issues
– Cutoff selection, updating models
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
The Problem of Credit Scoring
• Applicants apply for a bank loan
– Population 1 is rejected
– Population 2 is accepted
• Population 2a repays their loan -> labeled “good”
• Population 2b goes into some form of default -> labeled “bad”
• Model building
– Build a model that can discriminate population 2a from population 2b
– Usually treated as a classification problem
– Typically want to estimate p(good | features) and rank individuals this
way
• Widely used by banks and credit card companies
– Similar problems occur in direct marketing and other “scoring”
applications
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Many different applications for Customer Scoring
•
Other financial applications:
– Delinquent loans: who is most likely to pay up
• Uses historical data on who paid in the past
• Often used to create “portfolios” of delinquent debt
– Customer revenue
• How much will each customer generate in revenue over the next K years
•
Predicting marketing response
– Cost of a mailer to a customer is order of $1 dollar
– Targeted marketing
• Rank customers in terms of “likelihood to respond”
•
“Churn” prediction
•
Many more….
– Predicting which customers are most likely to switch to another brand
– E.g., wireless phone service
– Scores used to rank customers and then target most likely with incentives
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Some background
• History
– General ideas started in the 1950’s
• e.g., Bill Fair and Eric Isaac -> FairIsaac -> FICO scores
– Initially a bit contraversial
• Worries about it being unfair to some segments of society
– US Equal Opportunity Credit Acts, 1975/76
• Skepticism that “machine generated rules” from data could outperform
human generated guidelines
– First adopted in credit-card approvals (1960’s)
– Later broadly adopted in home-loans, etc
– Now widely accepted and used by almost all banks, credit-granting
agencies, etc
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Data Sources
•
Data from the loan application
•
Internal Performance data
•
External Performance data:
– Age, address, income, profession, SS#, number of credit cards, savings, etc
– Easy to obtain
– How the individual has performed on other loans with the same bank
– May only be available for a subset of customers
– Credit Reports
• How the individual has performed historically on all loans and credit cards
• Relatively expensive to obtain (e.g., $1 per individual)
– Court Judgements
– Real Estate records
•
Macro-level external data
– Demographic characteristics for applicant’s zip code or census tract
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Loan Application Data
• Issues
– Data entry errors (e.g., birthday = date of loan application)
– Deliberate falsifications (e.g., over-reporting of income)
– Legal issues
• US Equal Credit Opportunity Acts, 1975/76
• Illegal to use race, color, religion, national origin, sex, marital status, or age
in the decision to grant credit
• But what if other variables are highly predictive of some of these variables?
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Variable
Name
Description
Codings
dob
Year of birth
If unknown the year will be 99
nkid
Number of children
number
dep
Number of other dependents
number
phon
Is there a home phone
1=yes, 0 = no
sinc
Spouse's income
aes
Applicant's employment status
V = Government
W = housewife
M = military
P = private sector
B = public sector
R = retired
E = self employed
T = student
U = unemployed
N = others
Z = no response
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Variable
Name
Description
dainc
Applicant's income
res
Residential status
Codings
O = Owner
F = tenant furnished
U = Tenant Unfurnished
P = With parents
N = Other
Z = No response
dhval
Value of Home
0 = no response or not owner
000001 = zero value
blank = no response
dmort
Mortgage balance outstanding
0 = no response or not owner
000001 = zero balance
blank = no response
doutm
Outgoings on mortgage or rent
doutl
Outgoings on Loans
douthp
Outgoings on Hire Purchase
doutcc
Outgoings on credit cards
Bad
Good/bad indicator
1 = Bad
0 = Good
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Credit Report Data
• Available from 3 major bureaus in the US:
– Experian, Trans-Union, and Equifax
• Data in the form of a list of transactions/events
– Typically needs to be converted into feature-value form
• E.g., “number of credit cards opened in past 12 months”
– Can result in a huge number of features
• Cost varies as a function of type and time-window of data requested
– Interesting problem: “cost-optimal” downloading of selected credit
report features adapted to each individual as a function of cheaper
features
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Defining Good and Bad
• Good versus Bad
– Not necessarily clear how to define 2 classes
– E.g.,
• bad = ever 3 or more payments in arrears?
• Bad = 2 or more payments in arrears more than once?
– A “spectrum” of behavior
• Never any problems in payments
• Occasional problems
• Persistent problems
– Typical to discard the intermediate cases and also those with insufficient
experience to reliably classify them
• Not ideal theoretically, but convenient
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Selecting a Data Set for Model Building
• Sample selection
– Typical sample sizes ~ 10k to 100k per class
– Should be representative of customers who will apply in the future
– Need to be able to get the relevant variables for this set of customers
• Internal performance data
• External performance data
• Etc
• External data sources (e.g., credit reports) can result in a very large
number of possible variables
– E.g., in the 1000’s
– E.g., “number of accounts opened in past 12/24/36/… months”
– Typically some form of variable selection done before building a model
• Often based on univariate criteria such as information gain
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Models used in Credit Scoring
•
Regression:
– Ignore the fact that we are estimating a probability
– Typically linear regression is used
•
Classification (more common approach)
–
–
–
–
–
–
•
Logistic regression (most widely used)
Decision trees (becoming more popular)
Neural networks (experimented with, but not used in practice so much)
Nearest neighbors
Model combining - some work in this area
SVMs - too new, relatively unproven
General comments
– Many trade-secrets, companies like FairIsaac do not publish details
– Generally the industry is conservative: prefer well-established methods
– Classification accuracy is only one part of the overall solution….
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Logistic Regression Models
log(odds)
logit(p )
p
log g-1( p ) = w0 + w1x1 +…+ wpxp
1-p
(
)
logit(p)
1.0
p 0.5
0.0
0
Training Data
Data Mining Lectures
w0 + w1x1
Note that near 0,
logit(p) is almost linear,
so linear and logistic regression
will be similar in this region
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Modeling Example
(from Hand and Henley paper)
Data Mining Lectures
Model
Bad Risk Rate (%)
k nearest neighbor with special
metric
43.09
k nearest neighbor (standard)
43.25
logistic regression
43.30
linear regression
43.36
decision tree
43.77
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Evaluation Methods
•
Decile/Centile reporting:
•
Receiver Operation Characteristics
•
Bad Risk rate = bad risk among those accepted
– Rank customers by predicted scores
– Report “lift” rate in each decile (and cumulatively) compared to accepting
everyone
– Vary classification threshold
– Plot proportion of good risks accepted vs. bad risks accepted
– Let p = proportion of good risks
– Let a = proportion accepted
e.g., can show that, with a > p, the bad risk rate among those accepted is lower
bounded by 1 – p/a
e.g., p = 0.45, a =0.70 => bad risk rate must be between 0.35 and 0.78
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Economics of Credit Scoring
•
Classification accuracy is not the appropriate metric
•
Benefit =
Increase in revenue from using model
- cost of developing and installing model
•
Model development: anywhere from $5k to $100k depending on the
complexity of modeling project
•
Model installation: can be expensive (software, testing, legal requirements)
•
Revenue increase based on estimate performance plus assumptions about
cost of bad risks versus good risks
•
Small improvements in accuracy (e.g., 1 to 5%) could lead to significant
gains if the model is used on large numbers of customers
– model maintenance and updating should also probably be included
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Problem of Reject Inference
• Typically the population available for training consists only of past
applicants who were accepted
– Application data is available for “rejects”, but no performance data
• Question:
– Is there a way to use the data from rejected applicants?
• Answer: no widely accepted approach. Methods include
– Define all rejects as “bad” (not reliable!)
– Build a statistical model (treat labels as missing, but biased)
• Cam be quite complex, see Section 5 in Hand and Henley paper
– Grant credit to some fraction of rejects and track their performance so
that the “full population” is sampled
• Rarely used for loans, but ideally is the best method
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Other issues
• Threshold selection
– Above what threshold should loans be granted
– Depends on goals of the project
• E.g., focusing on a small set of high-scoring customers versus “widening the
net” to include a larger number (but still minimizing risk)
• Time-dependent classification
– What really matters is what the customer will do at time t+T
– Can we model the “state” of a customer (rather than statically)?
• Still somewhat of a research topic
• Overrides
– Loans are still manually “signed-off”. The bank may sometimes override
the system’s recommendation
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
The model works… now what?
•
Implementation
– Depends on whether the model is replacing an existing automated model
– … or is the first time modeling is being applied to the problem
– Many software issues in terms of databases, security, etc
•
Monitoring and tracking
– Important to see how the scorecard works in practice
– Generating monthly/quarterly reports on scorecard performance
• (naturally there will be some delay in this)
– Analyzing in detail at performance on segments, by attribute, etc
•
Time for a new model?
– E.g., population has changed significantly
– E.g., new (cheap and useful) data available
– E.g., new modeling technology available
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Download