A novel credit scoring model based on feature selection and PSO Variable Name Description Codings dob Year of birth If unknown the year will be 99 nkid Number of children number dep Number of other dependents number phon Is there a home phone 1=yes, 0 = no sinc Spouse's income aes Applicant's employment status V = Government W = housewife M = military P = private sector B = public sector R = retired E = self employed T = student U = unemployed N = others Z = no response Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Outline • • • • What is classification? What is prediction? Classification by decision tree induction Prediction of Continuous Values Outline • • • • What is classification? What is prediction? Classification by decision tree induction Prediction of Continuous Values Classification vs. Prediction • Classification – Predicts categorical class labels – Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute; then, uses the model in classifying new data • Prediction – Models continuous-valued functions, i.e., predicts unknown or missing values Classification is the prediction for discrete and nominal values. wt? red green gray blue … pink with classification, one can predict in which bucket to put the ball, but can’t predict the weight of the ball Supervised and Unsupervised • Supervised classification is classification. – The class labels and the number of classes are known red green gray blue 3 4 2 1 ? ? 2 1 ? 3 … ? … • Unsupervised classification is clustering. – The class labels are unknown ? … pink … 4 n n • In Unsupervised classification,the number of classes may also be unknown ? ? 1 ? 2 … ? 3 4 … ? ? Typical Applications • • • • Credit approval Target marketing Medical diagnosis Treatment effectiveness analysis Classification Example:Credit approval • For example, credit scoring tries to assess the credit risk of a new customer. This can be transformed to a classification problem by: – creating two classes, good and bad customers. – A classification model can be generated from existing customer data and their credit behavior. – This classification model can then be used to assign a new potential customer to one of the two classes and hence accept or reject him. Classification Example:Credit approval Specific Example: • Banks generally have information on the payment behavior of their credit applicants. • Combining this financial information with other information about the customers like sex, age, income, etc., it is possible to develop a system to classify new customers as good or bad customers, (i.e., the credit risk in acceptance of a customer is either low or high, respectively). Classification Process Training Data Data Test Data Derive Classifier (Model) Estimate Accuracy Classification: Two-Step Process 1. Construct model: • By describing a set of predetermined classes 2. Use the model in prediction • • Estimate the accuracy of the model Use the model to classify unseen objects or future data Preparing Data Before Classification Data transformation: • Discretization of continuous data • Normalization to [-1..1] or [0 ..1] Data Cleaning: • Smoothing to reduce noise Relevance Analysis: • Feature selection to eliminate irrelevant attributes Training Data NAME RANK M ike M ary B ill Jim D ave A nne A ssistant P rof A ssistant P rof P rofessor A ssociate P rof A ssistant P rof A ssociate P rof Y E A R S TE N U R E D 3 7 2 7 6 3 no yes yes yes no no Step1 : Model Construction class label attribute NAME RANK M ike M ary B ill Jim D ave Anne Each tuple/sample in the training set is assumed to belong to a predefined class, as determined by the class label attribute Training Data tuple A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f 1-a. Extract a set of training data from the database YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no Step 1 : Model Construction Training Data NAME RANK M ike M ary B ill Jim D ave Anne classification rule YEARS TENURED A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f 1-b.Develop / adopt classification algorithms. 1-c. Use the training set to construct the model. 3 7 2 7 6 3 no yes yes yes no no The model is represented as: •classification rules, •decision trees, or •mathematical formulae Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification: Two-Step Process 2. Model Evaluation (Accuracy) • Estimate accuracy rate of the model based on a test set. – The known label of test sample is compared with the classified result from the model. – Accuracy rate is the percentage of test set samples that are correctly classified by the model. – Test set is independent of training set otherwise overfitting will occur. • The model is used to classify unseen objects – Give a class label to a new tuple – Predict the value of an actual attribute Step 2: Use the Model in Prediction 2-a. Extract a set of test data from the database 2-b.Use the classifier model to classify the test data Note: Test set is independent of training set otherwise over-fitting will occur. IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier Testing Data NAME Tom M erlisa G eorge Joseph RANK Y E A R S TE N U R E D A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Classified Data NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 yes Professor 5 yes Assistant Prof 7 yes Classification Process: Use the Model in Prediction 2-c. Compare the known label of the test data with the classified result from the model. 2-d. Estimate the accuracy rate of the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier Testing Data NAME Tom M erlisa G eorge Joseph RANK Y E A R S TE N U R E D A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Classified Data NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 yes Professor 5 yes Assistant Prof 7 yes Classification Process: Use the Model in Prediction 2-e. Modify the model if need be. 2-f. The model is used to classify unseen objects. •Give a class label to a new tuple •Predict the value of an actual attribute IF rank = ‘professor’ THEN tenured = ‘yes’ Unseen Data NAME Maria Juan Pedro Joseph RANK YEARS TENURED Assistant Prof 5 Associate Prof 3 Professor 4 Assistant Prof 8 Tenured? Classifier NAME Maria Juan Pedro Joseph RANK YEARS TENURED Assistant Prof 5 no Associate Prof 3 no Professor 4 yes Assistant Prof 8 no Classification Methods • • • • • Decision Tree Induction Neural networks Bayesian k-nearest neighbor classifier Case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches Improving Accuracy:Composite Classifier Classifier 1 Classifier 2 Combine votes Data Classifier 3 … Classifier n New Data Evaluating Classification Methods • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model • Robustness – handling noise and missing values • Scalability – efficiency in disk-resident databases • Interpretability: – understanding and insight provided by the model Outline • What is classification? What is prediction? • Classification by decision tree induction • Prediction of Continuous Values Outline • Introduction and Motivation • Background and Related Work • Preliminaries – Publications – Theoretical Framework – Empirical Framework : Margin Based Instance Weighting – Empirical Study • Planned Tasks Introduction and Motivation Feature Selection Applications T1 T2 ….…… TN C Pixels D1 D2 12 0 ….…… 6 Sports Vs … … 0 11 ….…… 16 … DM 3 10 ….…… 28 Travel … Documents Terms Jobs Samples Features(Genes or Proteins) Features Introduction and Motivation Feature Selection from High-dimensional Data High-Dimensional Data Feature Selection Algorithm MRMR, SVMRFE, Relief-F, F-statistics, etc. p: # of features n: # of samples High-dimensional data: p >> n Curse of Dimensionality: •Effects on distance functions •In optimization and learning •In Bayesian statistics Low-Dimensional Data Learning Models Classification, Clustering, etc. Knowledge Discovery on High-dimensional Data Feature Selection: Alleviating the effect of the curse of dimensionality. Enhancing generalization capability. Speeding up learning process. Improving model interpretability. Introduction and Motivation Stability of Feature Selection Feature Selection Method Feature Subset Feature Subset Feature Subset Training Data Training Data Training Data Consistent or not??? Stability Issue of Feature Selection Stability of Feature Selection: the insensitivity of the result of a feature selection algorithm to variations to the training set. Training Data Training Data Training Data Learning Learning Model Learning Model Model Learning Algorithm Stability of Learning Algorithm is firstly examined by Turney in 1995 Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently. Variable Name Description dainc Applicant's income res Residential status Codings O = Owner F = tenant furnished U = Tenant Unfurnished P = With parents N = Other Z = No response dhval Value of Home 0 = no response or not owner 000001 = zero value blank = no response dmort Mortgage balance outstanding 0 = no response or not owner 000001 = zero balance blank = no response doutm Outgoings on mortgage or rent doutl Outgoings on Loans douthp Outgoings on Hire Purchase doutcc Outgoings on credit cards Bad Good/bad indicator 1 = Bad 0 = Good Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Tiến trình trích chọn đặc trưng Search Strategies: Complete Search Sequential Search Random Search Evaluation Criteria Filter Model Wrapper Model Embedded Model Representative Algorithms Relief, SFS, MDLM, etc. FSBC, ELSA, LVW, etc. BBHFS, Dash-Liu’s, etc. 30/26 Evaluation Strategies • Filter Methods – Evaluation is independent of the classification algorithm. – The objective function evaluates feature subsets by their information content, typically interclass distance, statistical dependence or information-theoretic measures. Evaluation Strategies • Wrapper Methods – Evaluation uses criteria related to the classification algorithm. – The objective function is a pattern classifier, which evaluates feature subsets by their predictive accuracy (recognition rate on test data) by statistical resampling or crossvalidation. Naïve Search • Sort the given n features in order of their probability of correct recognition. • Select the top d features from this sorted list. • Disadvantage – Feature correlation is not considered. – Best pair of features may not even contain the best individual feature. Sequential forward selection (SFS) (heuristic search) • First, the best single feature is selected (i.e., using some criterion function). • Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected. • Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected. • This procedure continues until a predefined number of features are selected. SFS performs best when the optimal subset is small. 34 Example features added at each iteration Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. 35 Sequential backward selection (SBS) (heuristic search) • First, the criterion function is computed for all n features. • Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded. • Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features. • This procedure continues until a predefined number of features are left. SBS performs best when the optimal subset is large. 36 Example features removed at each iteration Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the top). The highest accuracy value is shown with a star. 37 Bidirectional Search (BDS) • BDS applies SFS and SBS simultaneously: – SFS is performed from the empty set – SBS is performed from the full set • To guarantee that SFS and SBS converge to the same solution – Features already selected by SFS are not removed by SBS – Features already removed by SBS are not selected by SFS “Plus-L, minus-R” selection (LRS) • A generalization of SFS and SBS – If L>R, LRS starts from the empty set and: • Repeatedly add L features • Repeatedly remove R features – If L<R, LRS starts from the full set and: • Repeatedly removes R features • Repeatedly add L features • LRS attempts to compensate for the weaknesses of SFS and SBS with some backtracking capabilities. Sequential floating selection (SFFS and SFBS) • An extension to LRS with flexible backtracking capabilities – Rather than fixing the values of L and R, floating methods determine these values from the data. – The dimensionality of the subset during the search can be thought to be “floating” up and down • There are two floating methods: – Sequential floating forward selection (SFFS) – Sequential floating backward selection (SFBS) P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection, Pattern Recognition Lett. 15 (1994) 1119–1125. Sequential floating selection (SFFS and SFBS) • SFFS – Sequential floating forward selection (SFFS) starts from the empty set. – After each forward step, SFFS performs backward steps as long as the objective function increases. • SFBS – Sequential floating backward selection (SFBS) starts from the full set. – After each backward step, SFBS performs forward steps as long as the objective function increases. Feature Selection using Genetic Algorithms (GAs) (randomized search) GAs provide a simple, general, and powerful framework for feature selection. PreProcessing Feature Extraction Feature Subset Feature Selection (GA) Classifier