PAKDD 2006 Data Mining Competition Submitted By: Tulsa Group #4 Abhay Barapatre Soumitra Rayarikar Abhijit Sadhu 01/03/2006 Stillwater, OK Executive Summary Following the successful launch of a third generation (3G) telecommunications network, an Asian telco operator wants to identify the probable customers who would like to switch to using their 3G network. This data set contains the information about various attributes of existing customers such as age, sex, subscription plan, model of mobile handset used etc. In order to help the telecom company identify the probable customers, we built predictive models using various modeling techniques like Logistic Regression (LR), Decision Tree (DT) Analysis, Artificial Neural Network (ANN) or their combination (Refer Report for details on all these techniques). The models were built using the variables from the data set containing information about existing customer usage and demography. Before building the models we studied the data and it was seen that there were a number of redundant and co-related variables in the data. We eliminated all such variables before we began the model building process. However, there were a few variables that we thought are really important in predicting whether a customer will be interested in buying 3G network. Accessing and using 3G network services requires high-end mobile phones having the latest technological features such as high data transfer rate, multimedia, software like JAVA platform etc. Based on this knowledge we made the assumption that customers shifting from 2G to 3G are very likely to use high-end handset models that are relatively new in the market and they can be important factors while deciding whether he/she will use 3G network communication. Thus, a few variables like handset age and handset model were forced into all the models irrespective of the whether they were statistically chosen by the models or not. After choosing the important variables and rejecting the redundant ones we used a total of 18,000 customer records to build the models and to test them. We started with building simple models like Logistic Regression and various Decision Trees to predict the customers who were likely to shift to 3G. However, since these models did not perform satisfactorily well when tested, we moved on to more complex methods of building models. In these methods we built different types of Artificial Neural Networks which were based on DT or Variable Selection methods, i.e. after choosing the variables manually we fed them to the DT or Variable Selection in order to trim the number of variables further down, before building the ANN model. This helped us to choose only the variables that were really affecting the probability whether a customer will make the shift or not. Also, in the end, we combined the ANN models to make an Ensemble model. The details of some of the best performing models are shown in the table below. Table1: Comparison of various Model Performances Model LR model Gini 3w ANN DT 3w gini ANN VS Chi sq Ensemble Misclassification Training Validation 27.3 15.59 17.51 18.48 17.00 26.85 19.56 19.55 19.8 19.00 Sensitivity Training Validation 59.86 75.51 83.24 78.84 80.90 62.38 73.31 81.52 79.48 79.32 Specificity Accuracy Training Validation Training Validation 85.66 93.3 81.72 84.22 86.05 83.64 87.86 80.53 83.18 82.73 72.7 84.41 82.49 81.52 83.00 73.15 80.44 80.45 80.2 81.00 From the above table we see that the Ensemble model outperforms all other models. This can be said based on the fact that this model has a relatively low value of misclassification rate with a reasonably high sensitivity and specificity values. Table 2: Table showing the relation between Sensitivity and Total 3G Predictions ANN DT 3w gini Ensemble Actual Correctly Sensitivity Customers Predicted 3G (Given) 83.24 832 1000 80.90 810 1000 3G Total 3G Predictions based on Model 1949 1664 2500 2000 1949 1748 1664 1500 Sensitivity in Percentage 1000 Total 3G Predictions 500 83.24 78.84 80.90 0 ANN DT 3w gini ANN VS Chi sq Ensemble M ode ls Fig1: Graph showing the trade-off between the sensitivity of a model and the total number of 3G customers predicted From the above graph we notice that there is a trade-off between the total number of predicted 3G customers and the number of 3G customers correctly predicted. In other words, as the sensitivity increases, the total 3G predictions also increase. This would mean that in order to predict a larger number of probable customers correctly, we would have to target a larger customer base. This would in turn lead to higher costs. Thus, even though the sensitivity of the ANN model is more than that of Ensemble model, we select the latter. As seen in Figure 1 and Table 2, we will have to target an extra 288 people in order to get 22 more customers while using ANN Gini 3way model. This is not justified because the cost of targeting the extra people might far exceed the profits obtained from the few extra customers gained. Thus, we selected the Ensemble model which has a good balance of the sensitivity and the total number of predicted 3G customers. The lift chart (shown in figure 9) validates our decision further. The lift chart is a plot of the people in deciles versus the customers shifting to 3G network services. For example, if we pick the top 50% of the people with the highest propensity based on Ensemble model, then the percentage of 3G customers will be about 83%, whereas, for the same 50% the response rate is 81% for ANN using Gini 3way method Conclusion From the Ensemble model that was built, it can be seen that the variables handset model and handset age are the most important variables. Thus, as stated earlier, the company should focus on customers having relatively new handsets having the latest technologies. Apart from these two variables there are several others like subscription plan used, average games utilization in last six months, number of days to handset expiry etc. (Refer Figure 5 for full list of variables) have been selected. The management should focus on these variables in order to increase their sales 3G network communications. Model Building: Approach and Understanding Introduction An Asian Telco is planning to launch a third generation (3G) telecommunications network. They want to target the probable customers using the customer data set containing information about customers’ usage and demography. This data set contains the information about various attributes of existing customers such as age, sex, subscription plan, model of mobile handset used etc. To determine the probable customers we built predictive models based on various Modeling Techniques like Logistic Regression, Decision Tree Analysis, Artificial Neural Network, and Ensemble models. (For a brief note on these techniques, what they are, how they work, Refer Modeling Techniques) Data Exploration An original sample dataset of 20,000 2G network customers and 4,000 3G network customers has been provided with 251 data fields (Variables). The target categorical variable is “Customer_Type” (2G/3G), which is a binary variable as it takes the value 2G or 3G. A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be made available with the target field missing and is meant to be used for prediction. Tools Used SAS Enterprise Miner 4.3 Modeling Techniques Used Logistic Regression Decision Tree Artificial Neural Networks Ensemble Details of Algorithm Used Data Modification In the data set we identified 150 variables which are either redundant or which does not have impact on the decision or target variable. We did this after careful analysis of data. Most of the data fields have average, total and standard deviation values for the same parameter. We kept average values of theses parameters, and manually removed the std. deviation and total values of the parameters such as payment methods, average bill, no of calls etc. which were kept as data fields in the data set. This ensured that no dummy variables and correlated variables are present in the model building process. Some nominal variables were placed as interval category variable in the dataset provided. We changed it manually to nominal category variables in the dataset. This was done in initial input node of the model. Replacement of Missing Values Replacement node belongs to the Modify category of the SAS SEMMA (Sample, Explore, Modify, Model, Assess) data mining process. We use the Replacement node to replace missing values and to trim specified non missing values in data sets that are used for data mining. We found that there were many variables which had significantly high missing values, we changed them using Tree imputation techniques for both class and interval scale variables. Sampling The Sampling node belongs to the Sample category of the SAS SEMMA (Sample, Explore, Modify, Model, Assess) data mining process. Sampling node is used to extract a sample of input data source. Sampling is done for extremely large databases because it tremendously decreases model fitting time. As long as the sample is sufficiently representative, patterns that appear in the data as a whole will be traceable in the sample. Sampling also closes the gap between huge data sets and human limitations. An original sample dataset of 20,000 2G network customers and 4,000 3G network customers, hence it can be seen that the dataset is biased towards 2G network customers. To remove this bias we used sampling node. We used dataset of strength 6000 and used equal number 2G and 3G network customers to predict the type of customers who will be going for the 3G network customers. We used the seed of 5837. This dataset was passed further for model building. Data Partition After sampling the data, the data is usually partitioned before modeling. Use the Data Partition node to partition input data into one of the following data sets: Train is used for preliminary model fitting. The analyst attempts to find the best model weights using this data set. Validation is used to assess the adequacy of the model in the Model Manager and in the Assessment node. The validation data set is also used for model fine-tuning. We split the data into 65% training, 25% validation and 10% testing. Data Transformation The Transform Variables node enables us to create new variables that are transformations of existing variables in the dataset. Transformations are useful when we want to improve the fit of a model to the data. For example, transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality in variables. After observing the distribution of we observe that most of the variables were not normally distributed we used log and Bucket transformation mostly for these variables. Some the major variables transformed were HS_AGE, DAYS_TO_CONTRACT_EXPIRY AVG_AMT_PAID etc. (Refer Appendix I for some of the transformations) Variable Selection The Variable Selection node also belongs to the Explore category of the SAS SEMMA (Sample, Explore, Modify, Model, Assess) data mining process. The given databases has two hundred and fifty one variables out of which we have identified around 100 potential model inputs (independent or exploratory variables) that can be used to predict the target ( response variable). Using the Variable Selection node we reduced the number of inputs rejecting input variables that are not related to the target. Although rejected variables are passed to subsequent nodes in the process flow diagram, they are not used as model inputs by successor modeling nodes. The Variable Selection node quickly identifies input variables which are useful for predicting the target variable(s) and are assigned input roles. We also selected around 30 variables and forced them in the model by manual selection process. The 42 final variables selected after this node were as follows (sample). HS_MODEL, HS_AGE SUBPLAN DAYS_TO_CONTRACT_EXPIRY AVG_VAS_GAMES AVG_CALL_OB AVG_MINS_INTRAN LINE_TENURE .. .. .. … Screen Shot From SAS Enterprise Miner (Figure 2) Logistic Regression Since our target variable is a binary categorical variable, the first model we build was the Logistic Regression Model. Logistic regression attempts to predict the probability that a categorical target will acquire the event of interest as a function of one or more independent inputs. First we built the LR model using stepwise selection method and significance level of 5%. The output is shown as below: Output for LR (Taken as Screenshot from SAS-Enterprise Miner 4.3) (Figure 3) But as we can see the sensitivity was very low. We tried different combination in logistic regression model. We used Forward Selection method, Backward elimination but there was no improvement in Misclassification values, Sensitivity and Accuracy values. We also observed that variable selection done by this model is also poor. The following are the results of the best LR Model built. Misclassification Rate Sensitivity Specificity Training Validation Training Validation Training Validation Best LR model 27.30% 26.85 59.86% 62.38% We proceeded further to build the Decision tree models. 85.66% 83.64% Decision Tree Model An empirical tree represents a segmentation of the data that is created by applying a series of simple rules. Each rule assigns an observation to a segment based on the value of one input. One rule is applied after another, resulting in a hierarchy of segments within segments. After reviewing the LR models we went on to build Decision tree models by selecting various criteria for building the tree such as Chi square, Gini Index and Entropy selection method. The following table gives the summary of various decision tree models. Model Chi-Sq 2w Chi-Sq 3w Entropy 2w Entropy 3w Gini 2w Gini 3w Misclassification Sensitivity Specificity Accuracy Training Validation Training Validation Training Validation Training Validation 22.33 24.56 77.32 76.96 78 73.9 77.67 75.44 21.26 23.39 81.12 81.2 76.44 71.77 78.74 76.61 23.67 24.94 73.3 73.1 79.26 77.09 76.33 75.05 21.79 24.33 74.45 73.53 81.84 77.89 78.21 75.67 23.56 24.89 73.87 73.2 78.91 77.09 76.43 75.05 21.59 23.67 79.87 79.84 76.99 72.67 78.41 76.33 It is observed that there is not much difference in the Misclassification rate, Sensitivity and Specificity values of Training and validation data. Hence we can say that our Model is stable and behaves in the same way for both training and validation data. The variables selected in the Best Decision Tree Model are Figure 4 The tree built on gives us the information that the people who have 3G network have Handset Model 10828, Age < 0.5 Months, with good games utilization feature Handset Model 10828, Age <5.5 months, less than 522 days to contract expiry and outbound call facility. Handset Model 10829 with the entire internet based features. But we felt that we should go ahead try for better values of Sensitivity and low values of Misclassification rate. Hence we decided to build the Artificial Neural Network Models. Artificial Neural Network (ANN) Model Artificial Neural Networks (ANN) explains the process of storing information as patterns, utilizing those patterns to find out possible the customers to be target for the new product in the present Problem. We build many Artificial Network Model and we have selected the following two models based on better values of Sensitivity Artificial Neural Network (ANN) Model on Gini Three way Split & Artificial Neural Network (ANN) Model on Variable selection node Misclassification Rate Training Validation ANN DT 3w gini 17.51% ANN VS Chi sq 18.48 Sensitivity Specificity Seed Used 19.55% 83.24% 81.72% 52132.00 19.8 78.84 84.22 95509 It is observed that there is not much difference in the Misclassification rate, Sensitivity and Specificity values of Training and validation data. Hence we can say that our Model is stable and behaves in the same way for both training and validation data. The variable Selected from both the Models are. Artificial Neural Network (ANN) Model on Gini Three way Split: List of variables: (Screenshot of SAS Enterprise Guide) (Figure 5) Artificial Neural Network (ANN) Model on Variable selection node List of variables: (Screenshot of SAS Enterprise Guide) (Figure 6) Ensemble The Ensemble node also belongs to the Model category of the SAS SEMMA (Sample, Explore, Modify, Model, and Assess) data mining process. The Ensemble node creates a new model by averaging the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple models that precede the Ensemble node in a process flow diagram. The new model that the Ensemble node creates is then used to score new data. The two models we used for Ensemble are Artificial Neural Network (ANN) Model on Gini Three way Split & Artificial Neural Network (ANN) Model on Variable selection node Screenshot of the Output from SAS Enterprise Miner (Figure 7) Ensemble Misclassification Rate Training Validation 17% 19% Sensitivity Specificity 80.90% 85.55% It is observed that there is not much difference in the Misclassification rate, Sensitivity and Specificity values of Training and validation data. Hence we can say that our Model is stable and behaves in the same way for both training and validation data. Screenshot of the Output from SAS Enterprise Miner When Scoring was done on target field (Figure 8) From the output we can say that of the in the holdout data our model will predict 1664 customers who have 3G network with 80.9 % sensitivity. Lift Charts (Figure 9) This a plot percentile of people in deciles versus response rate of the data sample for 3G network customers. If we pick the top 50% of the people in with the highest propensity based on ensemble model the response rate is around 83% , whereas , for the same 50% response rate is around 80% Artificial Neural Network by variable selection method and 81% for Artificial Neural Network by Gini testing method for decision tree. Appendix 1 Days_to_Contract Expiry Before Avg_Bill_Amount Avg_Bill_Voiced After