PAKDD Tournament DATA MINING REPORT Submitted by: PARTICIPANT 93 Group members: Karthik Vikram Yemmanur, Shehzad Hamza, Ravikiran Garimella, Sriram Munnangi and Ajay Cherukuri INTRODUCTION The PAKDD tournament provided us with a finance company’s customer data to find out a better solution for their existing cross-selling business problem. We followed the CRISP-DM process in creating our predictive model for the particular problem. The six major phases of the CRISP-DM processes are: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment The report will focus on arranging our model creation phase using the CRISP-DM methodology. BUSINESS UNDERSTANDING The first step in creating a useful predictive model is to understand the business context of the problem that we are solving. The data has been provided to us from a consumer-finance company who want to find better solutions for a cross-selling business problem. The company currently has a base of credit card consumers as well as a customer base of mortgage customers. The overlap between these two services is very small and we are expected to find a solution that targets effectively the customers that are more likely to request a home loan. DATA UNDERSTANDING The modeling dataset provided has 40,700 observations of which 40,000 customers have a “No” response (Target_flag = 0) and 700 customers have a “Yes” response (Target_Flag = 1). The Target_flag variable represents a customer that opened a home-loan account within a year of opening a credit card account. There are 42 variables in the dataset including the Target_flag variable. The other variables in the dataset included demographic variables such as Annual Income Range, Monthly Disposable Income, Occupation Code, Number of Dependents and what District they applied from. There are also business specific variables such as Customer Age, Checking Account status, what type of credit cards they have, how long they have been staying at their current residence, how long they have been at their current state of employment, and a number of other variables that capture the number of Credit Bureau inquiries in the last 6 and 12 months. The detailed variable descriptions are provided with the modeling dataset and most of these data seem to be relevant in a consumer finance company. After understanding what the data meant, we assigned a measurement level to each variable based on their description. The variable roles are shown below in Figure 1. Figure 1 DATA PREPARATION Our first step in the data preparation step involved finding out what variables could be used for our modeling and what variables could be discarded. After exploring the data, we found the following variables could be rejected: DISP_INC_CODE – rejected due to excessive missing values B_DEF_UNPD_L12M – rejected because of only 1 value. On further exploration of the dataset we also found a perfect correlation between the following variables: B_DEF_PAID_L12M B_DEF_UNPD_IND B_DEF_UNPD_L12M B_DEF_PAID_IND Since they were representing the same data, we decided to reject three of the variables and then just use one of them, since they all captured the same data. We decided to use the B_DEF_PAID_IND as the variable in our model and rejected the other 3 variables. The tournament data set provided is a highly skewed dataset with a response rate of just 1.72%. The first step in creating a predictive model for a dataset with such a sparse distribution of a positive target response is to balance the data. We created a balanced sample of the data that included all the 700 “Yes” responses (Target_flag = 1) and then took a random sample of 700 “No” responses (Target_flag = 0). The sample that we built our model on had a total of 1400 observations with half of them having a “Yes” response and the other half having a “No” Response. We also entered the prior probabilities of the target distribution in the date before building our predictive models. One of the key issues in the model is how the Bureau Enquiries variables were entered. It had valid values up to 97 and then 98 and 99 were special values. However, we found that the ‘Z’ response for the B_DEP_PAID_IND variable captured all the Bureau Enquiries variables that had a value of 98. There were a negligible number of 99’s in the dataset and none of them had a “Yes” response, so we chose to ignore them. So we filtered all values of the Bureau Enquiries variables up to 97. We also filtered extreme values for CURR_EMPL_MONTHS, PREV_EMPL_MONTHS, PREV_RES_MONTHS. We also filtered out the records in the Diner’s club card membership that had less than 5 observations. We also replaced ANNUAL_INCOME_RANGE which is a nominal variable into an ordinal variable with the measurements (1-5) as opposed to the various income-range buckets. We also took a look at the missing values in the dataset and found that only the class variables (nominal variables) had missing values. Since regression and neural networks won’t function accurately with missing values, we used a tree imputation method to replace all the class variables. MODELING We built several models on our sample dataset to see what model performed the best. We built the following models on the dataset using variable selection and also transforming variables where necessary: 1. Logistic Regression (Stepwise selection, α=0.05) 2. Polynomial Regression (Stepwise selection, α=0.05) 3. Gini Decision Tree 4. Entropy Decision Tree 5. Neural Network (3,4,6,8 hidden layers) 6. Ensemble models combining various models. Our main criterion for selecting an appropriate model was the ROC index for every model followed by the misclassification rate. We also found that the logistic regression model had a high misclassification rate and as such, we chose to not include these in our model. We also built an ensemble model that was a combination of the Gini Tree, Entropy Tree, Neural Network and a Polynomial Regression model. We used variable selection for the neural network and also created various ensemble models depending on the similarity of the misclassification rate. EVALUATION To evaluate the various models built, we created a random sample from the main data set and then scored the dataset. Since we had the actual values of the target variable, we used it to calculate the ROC index for a model. The model with the highest ROC index on our scoring data set was the ensemble model. The complexity of the model definitely increased as we used an ensemble model, but however, the actual number of customers correctly predicted also greatly increased as well as the ROC index. DEPLOYMENT For this final phase, we scored the dataset given to us and then produced an Excel file with the predicted probability that a customer would purchase a home loan from the company. These are arranged by customer ID. This phase would be finally complete when we can validate our results against the withheld target values.