Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD 2006 Data Mining Competition Prepared By:- TUL 2 1 Executive Summary The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided. The target variable given to us is “Customer_Type” (2G/3G) (Appendix I). It is a categorical variable which determines whether a particular customer is a 2G or a 3G subscriber. A list of independent variables containing customer information like demographic, usage patterns, credit history, etc and statistical measures like average values and standard deviations for a number of variables were provided. Based on the domain knowledge and structural relationships shared between the variables, we pruned the total number of variables to 157 from the original number of 250, thus getting rid of structurally related variables, unary variables, etc. Given the large number of variables, the critical issue in building a stable model is variable interpretation. Managerial know how is essential to prune the variables. The next step is data preparation which consists of data partition and balancing, and missing value replacement. The selected variables are used as inputs for various models built on different modeling techniques like Logistic Regression, Artificial Neural Network, Decision Tree and an Ensemble Model. The different models built during our mining task were as follows: Logistic Regression model with stepwise selection and forced variables 5 Neuron ANN model with Chi-square variable selection 5 Neuron ANN model with R2 variable selection 5 Neuron ANN with Gini reduction decision tree Ensemble model combining the above four models Logistic Regression with stepwise selection and forced variables (Appendix II) A Stepwise selection was done to arrive at a list of statistically significant variables in the dataset. Further, as per domain knowledge and business characteristics on 3G technology we forced certain variables. The 39 variables selected as input for the LR model are depicted in Appendix I. Variable transformation was carried out in order to account for outliers, failures of normality, linearity and homoscedasticity. Logistic regression was run on the target variable and the comprehensive results of the model are as shown in the comparison table (See Report). Chi-square variable selection with 5 Neuron ANN (Appendix III) Variable selection node was used to filter the input variables using the Chi-square criterion. After variable selection using chi-square criteria, predictive modeling was done using artificial neural network. 2 The main consideration while building an artificial neural network is to locate a global optimal solution and this means finding a set of weights such that the network will produce the least possible error when records are passed through. Since the problem was quite complex in nature there could be large number of feasible solutions available. In order to reach the global optimal solution and avoiding the sub-optimal values, we used the advanced features of Neural Networks like randomized scale estimates, randomized target weights and randomized target bias weights. We also tried changing the random seeds and balancing sequence in SAS. The optimum solution based on sensitivity values and misclassification rates was obtained for the model with five neurons and a random seed of ‘1128’ with equal sized balancing. The network model that we used for prediction was the MLP (Multi Layered Perceptron) function. The advantage of using a MLP network is that it is effective on a wide range of problems and is capable of generalizing well. R2 variable selection with 5 Neuron ANN (Appendix IV) Variable selection was done using the R-square criterion. We used a squared correlation factor of less than 0.005 which means that all those input variables which have a squared correlation factor less than the cut-off criterion are assigned a rejected role. Then we used a stepwise R-square improvement factor of 0.0005 which signifies that all input variables that have a stepwise R-square improvement of less than the cut-off criterion are assigned the rejected role. The predictive Model used was Artificial Neural Network with 5 neurons. Gini reduction decision tree with 5 Neuron ANN (Appendix V) Variable Selection was carried out using Gini reduction Decision Tree. In this particular model we have decided to use a binary tree for our model testing. We used all three purity measures Gini, Entropy and Chi-square test for building decision trees. However, on careful evaluation we found that the Gini Reduction gave the best results. In pruning it is usually better to error on the side of complexity which yields a selection of a bushier tree. The network model that we used for prediction was Artificial Neural Network. In our case, we used the Artificial Neural Network Model and its advanced features for prediction purposes as the data from the decision tree was fed into it. Ensemble model combining the above four models (Appendix VI) The ensemble model node is used when one wishes to integrate the component models from two or more complimentary modeling methods to form a collective model solution. The ensemble model node helps perform stratified modeling, bagging, boosting and combined modeling which might help in the most accurate prediction. We performed combined modeling which creates a new model by averaging the posterior probability of the target variable from multiple models. 3 After building the various models, analyzing their result to select the best classification model was the next logical step. Comparative evaluation of the models on the basis of certain important parameters like Sensitivity, Misclassification Rate, Percentage Response, Lift Value, etc. was carried out. Sensitivity indicates the percentage of the true positives captured by the model. Misclassification Rate indicates the percentage of false negatives in the model. An optimum model would be the one which has high sensitivity values and the lowest classification rate. The sensitivity and misclassification rates of the models built by us are tabulated (See Report). Various lift charts were studied to facilitate the selection of the best model (AppendixVII) Cumulative Percentage Response Curve arranges people into deciles based on their predictive probability of response and then plots the actual percentage of 3G customers. Cumulative Percentage Captured Response Curve answers the question to what percentage of total number of 3G customers are present in a given decile, i.e. it demonstrates decile strength. The Non Cumulative Lift Value indicates the relative strength of the models. The predictive ability of a model is taken into consideration till those deciles which have a lift value greater than the lift value of the baseline model. The non cumulative values indicate the true percentage of customers in each decile separately. While selecting the best model a few trade offs have to be considered. In our opinion, the error of omission is graver than the error of commission. However the managerial intent is to capture the highest number of true positives. Thus a trade off has to be carried out so as to shortlist models that have high sensitivity along with low misclassification rate. Further we consider the percentage response, percentage captured response and lift values of the models short listed above. We arrive at the best model out of these which have high values of all three criteria. The model we selected was 5 Neuron ANN model with Chi-square variable selection as shown in (Appendix VIII). Conclusion: Thus in conclusion having started with the objective of finding the most accurate prediction of 3G customers in the prospective database, we carried out mining objectives in SAS 9.1. We got a feel for the data, and based on domain knowledge selected important variables and carried out data preparation. Different models were built including Logistic Regression, Decision trees, Artificial Neural Networks, etc and compared. The best model has thus been chosen and used to score the dataset. The predicted model constitutes only 1359 of the total dataset but are the most important customer segments as shown in (Appendix IX). This model, although very complex to explain enhanced the prediction rate as compared to the sample dataset by approximately 8 percent. 4 Report Problem definition: An Asian Telco operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network. An original sample dataset of 20,000 2G network customers and 4,000 3G network customers had been provided with more than 200 data fields as shown in Appendix I. The target categorical variable is “Customer_Type” (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be made available with the target field missing and is meant to be used for prediction. Translating the Business Goal to Data Mining Problem: The data mining task assigned to us was a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided. Variable description: The target variable given to us is “Customer_Type” (2G/3G). It is a categorical variable which determines whether a particular customer is a 2G or a 3G subscriber. A list of independent variables is provided along with their description. These variables contain a large number of customer information like demographic, usage patterns, value added services subscribed to, credit history, etc. They also cover statistical measures like average values and standard deviations for a number of variables. Based on the domain knowledge and structural relationships shared between the variables, we pruned the total number of variables to 157 from the original number of approximately 250. Approach Used: Given the large number of variables, the critical issue in building a stable model is variable selection. Along with the tools given in SAS, managerial know how is essential to prune the variables. A key point to note is that variable selection does not imply simply reducing the number of variables, but selecting those variables which would lead to high prediction rate. 5 Once variable selection is done, we use these as inputs for various models built on different modeling techniques like Logistic Regression, Artificial Neural Network, Decision Tree and an Ensemble Model. After carefully studying the misclassification rate, sensitivity and scoring numbers of each model, a prudent choice of the best model is made. Model Prospecting in SAS: Model prospecting is a complex process which requires adequate domain knowledge and sound data mining fundamentals. In general model prospecting comprises of the following: Data Preparation: Data Partition and Balancing: The distribution of the target variable in the data set is highly skewed. To resolve this bias issue we balanced the data so as to get a fair sample. A ‘Sampling’ node was used to do so. The Sampling node performs simple random sampling, nth-observation sampling, stratified sampling, first-n sampling, or cluster sampling of an input data set. In our model we did equal size stratified sampling. The random seed used was 1128. Data Partitioning provides mutually exclusive data for training, validation and testing. This helps avoiding the problem of over-fitting the data to a particular model. As a result, assessment of the models is done on data independent of those used for model generation. The data set was divided into 70% as training, 20% as validation and 10% as test a random seed of ‘1128’. using Missing Values Replacement: We used ‘Replacement Node’ to replace the missing values in the data set. As the name suggests, we can replace invalid data through default values (user defined) or the imputation of missing values using a wide range of imputation methods. Doing so, helps prevent the blurring of analysis. In our case we used the following: For numeric variables, tree imputation method was used. Tree imputation uses all available information (except the one from the imputed variable) as input to calculate the value of the imputed variable with a tree algorithm. Using this approach ensures that a maximum of information is used for replacing the missing value. For categorical variables we used a default constant ‘U’, as we would like to group together all missing values for each categorical variable. 6 Building Different Models 1. Logistic Regression model with stepwise selection and forced variables Variable Selection: Variable selection consisted of the following steps: Manual inspection of all variables Discarding variables on basis structural relationship. Reading literature pertinent to 3G technology lead us to gain some domain knowledge. Thus, we felt it would be significant for us to make some managerial decisions. For instance, forcing some variables as input, which were not selected by SAS initially. We performed ‘Stepwise Regression’ (Logistic Regression), modeling ‘Decision Trees’, using a ‘Variable Selection Node’. Each of these gave a set of significant variables. We considered these variables and added a few which we construed as managerially significant but were originally left out by SAS. Our decision of adding the variables was based on domain knowledge we gathered from our research on 3G technology. Also the fact that some important variables might have been rejected as a result of Stepwise selection some important variables might have been rejected by SAS on the basis of statistical significance pertaining to this particular data set. The 39 variables selected as input for the LR model are depicted in Appendix II. Variable Transformation: The data at hand is not clean in the sense that, we can not use it directly as input to our LR model. Transforming the data accounts for outliers, failures of normality, linearity and homoscedasticity. We attempt to get as normal a distribution possible. In SAS we achieve this by using the ‘Variable Transform’ node and choosing the ‘maximize normality’ option. Logistic Regression Model: A logistic regression model was built as our dependent variable was binary. Doing so provides us with theoretically permissible probabilistic values which are either of the 2 types only. However, it may be noted that the independent or the predictor variables can take any form. The goal of the logistic regression is to correctly predict the category of outcome for each case using the most parsimonious model. The output of a LR model provides the probability of success over the probability of failure in the form of an odds ratio. We used the ‘Regression’ node to run logistic regression. Input variables as described in the ‘Variable Selection’ process above were used. The regression was run on the training data set. The highlight of the model was a sensitivity of 84.99% for the training data set. The comprehensive results of the model are as shown in the comparison table below. 7 2. 5 Neuron ANN model with Chi-square variable selection: Variable Selection: Variable Selection Node (Chi-square) As the number of input variables to model increase there in as exponential increase in the data required to densely populate the model space. In our case we had a lot of input variables, out of which some variables could be redundant and irrelevant. In order to determine which variables could be disregarded in modeling without leaving behind the important information we use the variable selection node. In developing this model we have used the variable selection node to filter the input variables using the Chi-square criterion. The Chi-square criterion is used only for binary target variables. The variable selection in the Chi-square criterion is performed by using the binary variable splits for maximizing the Chi-square values of a 2x2 table. The criteria used while using the variable selection node with Chi –square method are depicted as below: Selection criterion: Chi-square Bins: 50 Chi-square: 3.84 Passes: 6 Cutoff:0.5 Each level of an ordinal or nominal input is decomposed into binary dummy variables. Interval inputs are binned into levels and the default value of 50 was selected. Other values specified in the table above were taken as default. Minimum Chi-square value of 3.84 is used to decide whether the split is worth considering. Higher the chi-square value, lower will be the number of splits. In our model, the number of passes was 6 which are used to determine the optimum number of splits. Predictive Modeling: Artificial Neural Network After variable selection using chi-square criteria, predictive modeling was done using artificial neural network as shown in Appendix III. An artificial neural network is a network of many simple processors, each possibly having a small amount of local memory. The units are connected by communication channels that usually carry numeric (as opposed to symbolic) data encoded by various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training. More specifically, neural networks are a class of flexible, nonlinear regression models, discriminate models, and data reduction models that are interconnected in a nonlinear dynamic system. Neural networks are useful tools for interrogating increasing volumes of data and for learning from examples to find patterns in data. By detecting complex 8 nonlinear relationships in data, neural networks can help make accurate predictions about real-world problems. Here in our case, we used the Artificial Neural Network Model and its advanced features for prediction purposes as the data from the decision tree was fed into it. Neural networks must 'learn' how to process input before they can be utilized in an application. The process of training a neural network involves adjusting the input weights on each neuron such that the output of the network is consistent with the desired output. This involves the development of a training file, which consists of data for each input node and the correct or desired response for each of the network's output nodes. Once the network is trained, only the input data are provided to the network, which then 'recalls' the response it 'learned' during training. The goal of building a neural network based model would be that it would able to predict the TARGET variable “customer_type” using selected variables from the inputs. The network model that we used for prediction was the MLP (Multi Layered Perceptron) function. In MLP the units each perform a biased weighted sum of their inputs and pass this activation level through a transfer function to produce their output, and the units are arranged in a layered feed forward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity. The main consideration while building an artificial neural network is to locate a global optimal solution and this means finding a set of weights such that the network will produce the least possible error when records are passed through. Since the problem was quite complex in nature there could be large number of feasible solutions available. In order to reach the global optimal solution and avoiding the sub-optimal values, we used the advanced features of Neural Networks like randomized scale estimates, randomized target weights and randomized target bias weights. We also tried changing the random seeds and balancing sequence in SAS. The optimum solution based on sensitivity values and misclassification rates was obtained for the model with five neurons and a random seed of ‘1128’ with equal sized balancing. The model classification criteria chosen by us was ‘Misclassification Rate’ as no financial numbers are provided to chose ‘profit/loss’ as the selection criteria. 9 3. 5 Neuron ANN model with R2 variable selection: Variable Selection: Variable Selection Node (R2) In developing our model we have used the variable selection node to filter the input variables using the R-square criterion. The R-square criterion helps to compute the squared correlation between each input and the target variable and then rejects those variables with R-square less than the cut-off criterion. The R-square method uses the stepwise correlation to evaluate the remaining input variable. In our model the parameters used in the variable selection are depicted below: Selection criterion: R-square Squared correlation<0.005 Stepwise R2 improvement<0.0005 Ignore 2-way interactions Do not bin interval variables (AOV16) Use only grouped class variables Cutoff:0.5 We used a squared correlation factor of less than 0.005 which means that all those input variables which have a squared correlation factor less than the cut-off criterion are assigned a rejected role. Then we used a stepwise R-square improvement factor of 0.0005 which signifies that all input variables that have a stepwise R-square improvement of less than the cut-off criterion are assigned the rejected role. This method uses the grouped class variables with the R-square selection criterion, which enables the variable selection node to reduce the number of levels of each class variable to a group variable based on the relationship with the target variable. The use only grouped class variables controls whether only the group variable or both the group variable and the original class variables are used as shown in Appendix IV. Predictive Model: Artificial Neural Network In order to reach the global optimal solution and avoiding the sub-optimal values, we used the advanced features of Neural Networks like randomized scale estimates, randomized target weights and randomized target bias weights. We also tried changing the random seeds and balancing sequence in SAS. The optimum solution based on sensitivity values and misclassification rates was obtained for the model with five neurons and a random seed of ‘1128’ with equal sized balancing. The network model that we used for prediction was the MLP (Multi Layered Perceptron) function. The model classification criteria chosen by us was ‘Misclassification Rate’ as no financial numbers are provided to chose ‘profit/loss’ as the selection criteria. 10 4. 5 Neuron ANN model with Gini reduction decision tree: Variable Selection: Decision Tree. A tree, also known as a decision tree is so called because the predictive model for banding can be represented in a tree-like structure. A decision tree is read from top down starting in the root node. Each internal node represents a split based on the values of one of the inputs with the goal of maximizing the relationship with the target. Consequently, nodes get purer (more or fewer bands depending on the split) the further down the tree. Here we use the decision tree to further select only the variables important in growing the tree for further modeling. The variables that are considered important are scaled between 0 and 1 and typically variables which have an importance factor of less than 0.05 are set to rejected in the subsequent nodded that follow the decision tree as shown in Appendix V. In this particular model we have decided to use a binary tree for variable selection. We used all three purity measures Gini, Entropy and Chi-square test for building decision trees. However, on careful evaluation we found that the Gini Reduction gave the best results. Also an important point to note here would be the fact that in pruning it is usually better to error on the side of complexity which yields a selection of a bushier tree and this lead to selection of the Gini Reduction model as compared to other purity measures. The decision tree node has the following parameters set for it Splitting criterion: Gini Reduction Minimum number of observations in a leaf: 5 Observations required for a split search: 20 Maximum number of branches from a node: 2 Maximum depth of tree: 6 Splitting rules saved in each node: 5 Surrogate rules saved in each node: 0 Treat missing as an acceptable value Model assessment measure: Average Square Error (Gini index) Sub tree: Best assessment value Observations sufficient for split search: 3600 Maximum tries in an exhaustive split search: 5000 Do not use profit matrix during split search Do not use prior probability in split search 11 Predictive Model: Artificial Neural Network In order to reach the global optimal solution and avoiding the sub-optimal values, we used the advanced features of Neural Networks like randomized scale estimates, randomized target weights and randomized target bias weights. We also tried changing the random seeds and balancing sequence in SAS. The optimum solution based on sensitivity values and misclassification rates was obtained for the model with five neurons and a random seed of ‘1128’ with equal sized balancing. The network model that we used for prediction was the MLP (Multi Layered Perceptron) function. The model classification criteria chosen by us was ‘Misclassification Rate’ as no financial numbers are provided to chose ‘profit/loss’ as the selection criteria. In our case, we used the Artificial Neural Network Model and its advanced features for prediction purposes as the data from the decision tree was fed into it. 5. Ensemble Model combining the above four models: The ensemble model node is used when one wishes to integrate the component models from two or more complimentary modeling methods to form a collective model solution. The ensemble model node helps perform stratified modeling, bagging, boosting and combined modeling which might help in the most accurate prediction. We performed combined modeling which creates a new model by averaging the posterior probability of the target variable from multiple models as shown in Appendix VI. The ensemble model was created from the following models: Logistic Regression with stepwise selection and forced variables Gini reduction decision tree with 5 neuron ANN R2 variable selection with 5 neuron ANN Chi-square variable selection with 5 neuron ANN 12 Comparison After building the various models, analyzing their result to select the best classification model was the next logical step. Comparative evaluation of the models on the basis of certain important parameters like Sensitivity, Misclassification Rate, Percentage Response, Lift Value, etc. was carried out. Sensitivity indicates the percentage of the true positives captured by the model. Misclassification Rate indicates the percentage of false negatives in the model. An optimum model would be the one which has high sensitivity values and the lowest classification rate. The sensitivity and misclassification rates of the models built by us are tabulated below. We further analyze the models using lift charts. Lift charts is one of the simplest graphical tools to interpret the predictive ability of the model. The following lift charts were studied by us as shown in Appendix VII. Cumulative Percentage Response Curve: this chart arranges people into deciles based on their predictive probability of response and then plots the actual percentage of 3G customers. Cumulative Percentage Captured Response Curve: this chart answers the question to what percentage of total number of 3G customers are present in a given decile, i.e. it demonstrates decile strength. 13 Non Cumulative Lift Value: it indicates the relative strength of the models. The predictive ability of a model is taken into consideration till those deciles which have a lift value greater than the lift value of the baseline model. The non cumulative values indicate the true percentage of customers in each decile separately. Model selection: While selecting the best model a few trade offs have to be considered. In our opinion, the error of omission is graver than the error of commission. However the managerial intent is to capture the highest number of true positives. Thus a trade off has to be carried out so as to shortlist models that have high sensitivity along with low misclassification rate. Further we consider the percentage response, percentage captured response and lift values of the models short listed above. We arrive at the best model out of these which have high values of all three criteria. The model development was carried out in Enterprise Miner 4.2 as depicted in Appendix VIII. Based on sensitivity values and misclassification rate we short listed the following models: Logistic Regression model and (Sensitivity: ) 5 Neuron ANN model with Chi-square variable selection. 5 Neuron ANN with DT variable selection. Out of the above three models, the 5 Neuron ANN model with Chi-square variable selection gave the best results on lift charts and ROC curve. We thus decided to score the dataset using this model. In the following section we explain this selected model in detail. 14 Classification model Explained: Important variables in the classification model: The variables selected by the variable selection node are tabulated below: HS_MODEL HS_AGE SUBPLAN DAYS_TO_CONTRACT_EXPIRY LINE_TENURE AVG_BILL_AMT SUBPLAN_PREVIOUS TOT_USAGE_DAYS TOP1_INT_CD TOP2_INT_CD Thus targeting should be done giving higher preference to these variables. Scoring: The score node enables to generate and manage predicted values of the target variable from a trained model. Scoring formulae are created for both assessment and prediction in Enterprise Miner. Scoring the dataset with our model suggests that a total of 22.65% customers from the prospective database would subscribe to 3G service. This a marked improvement from the 16.33% response rate in the training dataset. 15 Conclusion and Recommendations Often, a smaller and a better targeted campaign could actually turn out to be more profitable than a larger and a more expensive one based on a decent model. A case in point is the fact that though we had approximately 250 variables initially, only 10 are significant. The deliverable of the model is not only to score the dataset. The larger purpose is to gain insight about the customers who would adopt the 3G technology. The careful analysis of the data provided and the prediction made will provide the marketers to target the more fruitful subset from the prospective list. Thus in conclusion having started with the objective of finding the most accurate prediction of 3G customers in the prospective database, we carried out mining objectives in SAS 9.1. We got a feel for the data, and based on domain knowledge selected important variables and carried out data preparation. Different models were built including Logistic Regression, Decision trees, Artificial Neural Networks, etc and compared. The best model has thus been chosen and used to score the dataset. The predicted people constitute only 1359 of the total dataset but are the most important customer segments as shown in Appendix IX. This model, although very complex to explain enhanced the prediction rate as compared to the sample dataset by approximately 8 percent. 16 Appendix I: Input Data: Distribution of the Target Variable: Customer_Type 17 Appendix II: Logistic Regression: Effect T-Scores in Logistic Regression Model: 18 Row Frequency for Logistic Regression Models: 19 Parameter Estimates for Logistic Regression Models: 20 Appendix III: Model: Variable Selection using Chi-Square criterion and Artificial Neural Networks: Variables Selected: Artificial Neural Network: 21 Average Error for Artificial Neural Network: 22 Appendix IV: Model: Variable Selection using R-Square criterion and Artificial Neural Network: R-Square Values for Target Variable “Customer_Type”: 23 Effects for Customer_Type: 24 Fit Statistics for Artificial Neural Network after Variable Selection R-square: 25 Average Square Error for Artificial Neural Network 26 Appendix V: Model: Decision Tree with Artificial Neural Network 27 Decision Tree –Ring and Average Square Error: 28 Fit Statistics for Artificial Neural Network used with Decision Tree: 29 Average Error Plot for Artificial Neural Network: 30 Appendix VI: Model: Ensemble Fit Statistics: 31 Appendix VII: Lift Charts: % Response – Cumulative 32 %Captured Response- Cumulative 33 Lift Value: 34 ROC Chart: 35 Appendix VIII: Model Development: 36 Appendix IX: After Scoring: Variable Selection Using Chi-Square and Artificial Neural Networks The Distribution of the Target Variable: I_Customer_Type 37