A novel credit scoring model based on feature

A novel credit scoring model based
on feature selection and PSO
Variable
Name
Description
Codings
dob
Year of birth
If unknown the year will be 99
nkid
Number of children
number
dep
Number of other dependents
number
phon
Is there a home phone
1=yes, 0 = no
sinc
Spouse's income
aes
Applicant's employment status
V = Government
W = housewife
M = military
P = private sector
B = public sector
R = retired
E = self employed
T = student
U = unemployed
N = others
Z = no response
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Outline
•
•
•
•
What is classification?
What is prediction?
Classification by decision tree induction
Prediction of Continuous Values
Outline
•
•
•
•
What is classification?
What is prediction?
Classification by decision tree induction
Prediction of Continuous Values
Classification vs. Prediction
• Classification
– Predicts categorical class labels
– Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute; then, uses the model in
classifying new data
• Prediction
– Models continuous-valued functions, i.e., predicts
unknown or missing values
Classification is the prediction for discrete and
nominal values.
wt?
red
green
gray
blue
…
pink
with classification, one can predict in which bucket
to put the ball,
but can’t predict the weight of the ball
Supervised and Unsupervised
• Supervised classification is classification.
– The class labels and the number of classes are known
red
green
gray
blue
3
4
2
1
?
?
2
1
?
3
…
?
…
• Unsupervised classification is clustering.
– The class labels are unknown
?
…
pink
…
4
n
n
• In Unsupervised classification,the number of classes may also be
unknown
?
?
1
?
2
…
?
3
4
…
?
?
Typical Applications
•
•
•
•
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis
Classification Example:Credit approval
• For example, credit scoring tries to assess the
credit risk of a new customer. This can be
transformed to a classification problem by:
– creating two classes, good and bad customers.
– A classification model can be generated from existing
customer data and their credit behavior.
– This classification model can then be used to assign a
new potential customer to one of the two classes and
hence accept or reject him.
Classification Example:Credit approval
Specific Example:
• Banks generally have information on the
payment behavior of their credit applicants.
• Combining this financial information with
other information about the customers like
sex, age, income, etc., it is possible to
develop a system to classify new customers
as good or bad customers, (i.e., the credit risk
in acceptance of a customer is either low or
high, respectively).
Classification Process
Training
Data
Data
Test
Data
Derive
Classifier
(Model)
Estimate
Accuracy
Classification: Two-Step Process
1. Construct model:
•
By describing a set of predetermined classes
2. Use the model in prediction
•
•
Estimate the accuracy of the model
Use the model to classify unseen objects or
future data
Preparing Data Before Classification
Data transformation:
• Discretization of continuous data
• Normalization to [-1..1] or [0 ..1]
Data Cleaning:
• Smoothing to reduce noise
Relevance Analysis:
• Feature selection to eliminate irrelevant attributes
Training Data
NAME RANK
M ike
M ary
B ill
Jim
D ave
A nne
A ssistant P rof
A ssistant P rof
P rofessor
A ssociate P rof
A ssistant P rof
A ssociate P rof
Y E A R S TE N U R E D
3
7
2
7
6
3
no
yes
yes
yes
no
no
Step1 : Model Construction
class label attribute
NAME RANK
M ike
M ary
B ill
Jim
D ave
Anne
Each tuple/sample in the
training set is assumed to
belong to a predefined
class, as determined by
the class label attribute
Training
Data
tuple
A ssistan t P ro f
A ssistan t P ro f
P ro fesso r
A sso ciate P ro f
A ssistan t P ro f
A sso ciate P ro f
1-a. Extract a set of
training data from the
database
YEARS TENURED
3
7
2
7
6
3
no
yes
yes
yes
no
no
Step 1 : Model Construction
Training
Data
NAME RANK
M ike
M ary
B ill
Jim
D ave
Anne
classification rule
YEARS TENURED
A ssistan t P ro f
A ssistan t P ro f
P ro fesso r
A sso ciate P ro f
A ssistan t P ro f
A sso ciate P ro f
1-b.Develop / adopt
classification algorithms.
1-c. Use the training set to
construct the model.
3
7
2
7
6
3
no
yes
yes
yes
no
no
The model is represented
as:
•classification rules,
•decision trees, or
•mathematical formulae
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification: Two-Step Process
2. Model Evaluation (Accuracy)
• Estimate accuracy rate of the model based on a
test set.
– The known label of test sample is compared with the
classified result from the model.
– Accuracy rate is the percentage of test set samples that
are correctly classified by the model.
– Test set is independent of training set otherwise overfitting will occur.
• The model is used to classify unseen objects
– Give a class label to a new tuple
– Predict the value of an actual attribute
Step 2: Use the Model in Prediction
2-a. Extract a set of test data from the
database
2-b.Use the classifier model to classify
the test data
Note: Test set is independent of
training set otherwise over-fitting will
occur.
IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data
NAME
Tom
M erlisa
G eorge
Joseph
RANK
Y E A R S TE N U R E D
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Classified
Data
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
yes
Professor
5
yes
Assistant Prof
7
yes
Classification Process: Use the Model in Prediction
2-c. Compare the known label
of the test data with the
classified result from the model.
2-d. Estimate the accuracy rate of the model.
Accuracy rate is the percentage of test set samples
that are correctly classified by the model.
IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data
NAME
Tom
M erlisa
G eorge
Joseph
RANK
Y E A R S TE N U R E D
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Classified
Data
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
yes
Professor
5
yes
Assistant Prof
7
yes
Classification Process: Use the Model in Prediction
2-e. Modify the model if
need be.
2-f. The model is used to classify unseen objects.
•Give a class label to a new tuple
•Predict the value of an actual attribute
IF rank = ‘professor’
THEN tenured = ‘yes’
Unseen
Data
NAME
Maria
Juan
Pedro
Joseph
RANK
YEARS TENURED
Assistant Prof
5
Associate Prof
3
Professor
4
Assistant Prof
8
Tenured?
Classifier
NAME
Maria
Juan
Pedro
Joseph
RANK
YEARS TENURED
Assistant Prof
5
no
Associate Prof
3
no
Professor
4
yes
Assistant Prof
8
no
Classification Methods
•
•
•
•
•
Decision Tree Induction
Neural networks
Bayesian
k-nearest neighbor classifier
Case-based reasoning
• Genetic algorithm
• Rough set approach
• Fuzzy set approaches
Improving Accuracy:Composite Classifier
Classifier 1
Classifier 2
Combine
votes
Data
Classifier 3
…
Classifier n
New Data
Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provided by the
model
Outline
• What is classification? What is prediction?
• Classification by decision tree induction
• Prediction of Continuous Values
Outline
• Introduction and Motivation
• Background and Related Work
• Preliminaries
– Publications
– Theoretical Framework
– Empirical Framework : Margin Based Instance Weighting
– Empirical Study
• Planned Tasks
Introduction and Motivation
Feature Selection Applications
T1 T2 ….…… TN
C
Pixels
D1
D2
12 0 ….…… 6
Sports
Vs
…
…
0 11 ….…… 16
…
DM
3 10 ….…… 28 Travel
…
Documents
Terms
Jobs
Samples
Features(Genes or Proteins)
Features
Introduction and Motivation
Feature Selection from High-dimensional Data
High-Dimensional Data
Feature Selection Algorithm
MRMR, SVMRFE, Relief-F,
F-statistics, etc.
p: # of features
n: # of samples
High-dimensional data: p >> n
Curse of Dimensionality:
•Effects on distance functions
•In optimization and learning
•In Bayesian statistics
Low-Dimensional Data
Learning Models
Classification,
Clustering, etc.
Knowledge Discovery on High-dimensional Data
Feature Selection:
Alleviating the effect of the curse of
dimensionality.
Enhancing generalization capability.
Speeding up learning process.
Improving model interpretability.
Introduction and Motivation
Stability of Feature Selection
Feature Selection Method
Feature Subset
Feature Subset
Feature Subset
Training Data
Training Data
Training Data
Consistent or
not???
Stability Issue of Feature Selection
Stability of Feature Selection: the insensitivity of the result of
a feature selection algorithm to variations to the training set.
Training Data
Training Data
Training Data
Learning
Learning
Model
Learning
Model
Model
Learning Algorithm
Stability of Learning Algorithm is
firstly examined by Turney in 1995
Stability of feature selection
was relatively neglected before
and attracted interests from researchers in
data mining recently.
Variable
Name
Description
dainc
Applicant's income
res
Residential status
Codings
O = Owner
F = tenant furnished
U = Tenant Unfurnished
P = With parents
N = Other
Z = No response
dhval
Value of Home
0 = no response or not owner
000001 = zero value
blank = no response
dmort
Mortgage balance outstanding
0 = no response or not owner
000001 = zero balance
blank = no response
doutm
Outgoings on mortgage or rent
doutl
Outgoings on Loans
douthp
Outgoings on Hire Purchase
doutcc
Outgoings on credit cards
Bad
Good/bad indicator
1 = Bad
0 = Good
Data Mining Lectures
Lecture 18: Credit Scoring
Padhraic Smyth, UC Irvine
Tiến trình trích chọn đặc trưng
Search Strategies:
 Complete Search
 Sequential Search
 Random Search
Evaluation Criteria
 Filter Model
 Wrapper Model
 Embedded Model
Representative Algorithms
 Relief, SFS, MDLM, etc.
 FSBC, ELSA, LVW, etc.
 BBHFS, Dash-Liu’s, etc.
30/26
Evaluation Strategies
• Filter Methods
– Evaluation is independent
of the classification
algorithm.
– The objective function
evaluates feature subsets
by their information
content, typically interclass
distance, statistical
dependence or
information-theoretic
measures.
Evaluation Strategies
• Wrapper Methods
– Evaluation uses criteria
related to the classification
algorithm.
– The objective function is a
pattern classifier, which
evaluates feature subsets
by their predictive
accuracy (recognition rate
on test data) by statistical
resampling or crossvalidation.
Naïve Search
• Sort the given n features in order of their probability of
correct recognition.
• Select the top d features from this sorted list.
• Disadvantage
– Feature correlation is not considered.
– Best pair of features may not even contain the best individual
feature.
Sequential forward selection (SFS)
(heuristic search)
• First, the best single feature is selected
(i.e., using some criterion function).
• Then, pairs of features are formed using
one of the remaining features and this best
feature, and the best pair is selected.
• Next, triplets of features are formed using
one of the remaining features and these two
best features, and the best triplet is
selected.
• This procedure continues until a predefined
number of features are selected.
SFS performs
best when the
optimal subset is
small.
34
Example
features added at
each iteration
Results of sequential forward feature selection for classification of a satellite image
using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the
features added at each iteration (the first iteration is at the bottom). The highest
accuracy value is shown with a star.
35
Sequential backward selection (SBS)
(heuristic search)
• First, the criterion function is computed for all
n features.
• Then, each feature is deleted one at a time,
the criterion function is computed for all
subsets with n-1 features, and the worst
feature is discarded.
• Next, each feature among the remaining n-1
is deleted one at a time, and the worst
feature is discarded to form a subset with n-2
features.
• This procedure continues until a predefined
number of features are left.
SBS performs
best when the
optimal subset is
large.
36
Example
features removed at
each iteration
Results of sequential backward feature selection for classification of a satellite image
using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the
features removed at each iteration (the first iteration is at the top). The highest accuracy
value is shown with a star.
37
Bidirectional Search (BDS)
• BDS applies SFS and SBS simultaneously:
– SFS is performed from the empty set
– SBS is performed from the full set
• To guarantee that SFS and SBS converge
to the same solution
– Features already selected by SFS are not
removed by SBS
– Features already removed by SBS are not
selected by SFS
“Plus-L, minus-R” selection (LRS)
• A generalization of SFS and SBS
– If L>R, LRS starts from the empty set and:
• Repeatedly add L features
• Repeatedly remove R features
– If L<R, LRS starts from the full set and:
• Repeatedly removes R features
• Repeatedly add L features
• LRS attempts to compensate for the weaknesses of
SFS and SBS with some backtracking capabilities.
Sequential floating selection
(SFFS and SFBS)
• An extension to LRS with flexible backtracking capabilities
– Rather than fixing the values of L and R, floating methods
determine these values from the data.
– The dimensionality of the subset during the search can be
thought to be “floating” up and down
• There are two floating methods:
– Sequential floating forward selection (SFFS)
– Sequential floating backward selection (SFBS)
P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature
selection, Pattern Recognition Lett. 15 (1994) 1119–1125.
Sequential floating selection
(SFFS and SFBS)
• SFFS
– Sequential floating forward selection (SFFS) starts from the
empty set.
– After each forward step, SFFS performs backward steps as long
as the objective function increases.
• SFBS
– Sequential floating backward selection (SFBS) starts from the
full set.
– After each backward step, SFBS performs forward steps as long
as the objective function increases.
Feature Selection using
Genetic Algorithms (GAs)
(randomized search)
GAs provide a simple, general, and powerful framework
for feature selection.
PreProcessing
Feature
Extraction
Feature
Subset
Feature
Selection
(GA)
Classifier