Predictive Modeling & Scoring (PAKDD 2006 Data Mining competition) By ASIAN SQUARE (STL5) Vinod Kanniah Han-Mo Oh Vinod Kanniah Pimtong Tavitiyaman Han-Mo Oh Pawinee Boonyasopon Pimtong Tavitiyaman Oklahoma University PawineeState Boonyasopon Stillwater Oklahoma-74075 United States America Due date: March 1, 2006 3 Table of Content Executive Summary ……………………………………………………………. 3 Business Understanding ………………………………………………………... 4 Data Understanding…………………………………………………………….. 4 Data Source Data Set Missing Value and Miscoding Data Preparation………………………………………………………………… 5 Recoding and Data Type Adjustment Missing Values Data Adjustment Modeling………………………………………………………………………... 7 Data Set Attributes Sampling Data Partition Model Building Process Evaluation……………………………………………………………………… 8 Best Model Rules in Decision Tree Important Variables in the Final Model Scoring Deployment ………………………………………………………………………. 10 Conclusion ……………………………………………………………………... 11 Appendix ……………………………………………………………………...... 12 2 Executive Summary The sintrain dataset of the 2G and 3G customers given by PAKDD was used to predict the customer types and the potential of the 2G customers changing from 2G to 3G networks. With the better roaming and transferring data capabilities of the 3G network service there is a good marketing opportunity to identify the potential 2G customers who may shift to the next generation 3G networks. Our team analyzed the prediction based on the sintrain dataset provided by PAKDD. SAS Enterprise Miner version 4.3 was used for the predictions. On initial exploration of the dataset we observed that the variables were highly miscoded and had some less meaningful information for the desired prediction. Hence, appropriate judgments were made to reject variables showing redundancy, skew ness, high correlation and improper data entry. The retained variables were replaced and transformed for the normal distribution. Five different models were analyzed on three different datasets: the original dataset (217 input variables), the dataset with variables retained after judgments made by our team (141 input variables) and the dataset containing variables chosen by the team with the replacement and transformations (141 input variables). A random seed of 2103 and a sample size of 6000 corresponding to 50% 2G and 50% 3G were used. The dataset was splitted into training (60%), validation (30%) and testing (10%) with the same random seed. Each dataset was used to run the five prediction models namely Logistic Regression, Decision Tree, Variable selection(ChiSquare/R-Square) followed by Artificial Neural Network/Decision Tree/ Logistic Regression, Decision Tree followed by ANN/Logistic regression and Ensemble model. Three models were evaluated based on the accuracy measures, such as classification accuracy, sensitivity, and specificity. However, the differences of accuracy measures among these models were insignificant. Thus, we compared the models by using lift charts and concluded that the Decision Tree (C4.5 algorithm) as variable selection followed by ANN (MLP) model as the best model for predicting the 3G customers. The accuracy, sensitivity and specificity of our final model were 80.56 %, 74.19% and 86.90% respectively. We observed that this model would accurately identify 57% of the 3G customers in the first 30% of the cumulative customers in the database which was slightly higher than the other two models. Also, the profit was the highest in this model than the other two considering the first 20 % of the cumulative customers. The model numbers, average GPRS data utilization (Kbs) in last 6 months, the age of the handset in months, Average Games utilization in the last 6 months, and the days to the contract expiry were the major variables to predict the 3G customers. The manager may consider focusing these particular customer market segments to improve the buying potential of the customers to 3G network services. The model and the results are explained in detail in this report. The suggestions made above are based on the business objective, data analysis and testing different models. The manager should update this model after every campaign and make necessary changes for future prediction. If the manager believes that more variables are to be included or deleted based on his or her domain expertise then he or she may do so in the model. We strongly recommend the company to manage the data properly in the future and update the models based on the data available on that time. Also the results varied with different random seeds thus suggesting the manager to change the same with the updated data while applying to the population. 3 Our team followed the CRISP-DM steps to complete this project. The six major steps involved in it are Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment. Diagram 1: CRISP-DM steps Source: http://www.corporateintellect.com/services/crisp_dm02.htm 1. Business Understanding The objective of this project was to use the customer usage and demographic data to accurately identify the 3G customers in the scoring data set based on the training data set provided by PAKDD and to predict the potential customers who may tend to switch the network from 2G to 3G. 2G wireless systems has three system standards; CDMA, GSM, and TDMA and transfers the data at only 19.2 Kbps. On the other hand, 3G can transfer data rate to 2 Mbps, support large multimedia files and applications, and increase roaming capabilities (Graven, 2006). Thus, the company can use the results of our report to focus on the right target market and provide offers for 3G network service promotion towards marketing efforts. Moreover, it will help the company to compete in the highly competitive market. The business objective of this study is to 1. Build a prediction model to accurately predict the buying intention of 3G customers based on the training data set. 2. Employ the predicted model on the scoring data set to identify the 3G customer and to investigate the possibility of the existing 2G customers changing to 3G network service. 3. Examine the behavioral characteristics (variables) of existing customers to improve the marketing efforts on prospect customers. 2. Data Understanding Our team used SAS Enterprise Miner version 4.3 for data exploration, analysis and modeling. Data Source We used the dataset provided by PAKDD. The training dataset had a target variable (CUSTOMER_TYPE) to build a prediction model and hence identify the target variable in the scoring data set. The sintrain dataset was the information of existing 2G and 3G customers. 4 Information such as customer’s demographic profiles, promotions, using behaviors, and network usages (2G and 3G) were provided in this dataset. Data Set There were two datasets given. We used the sintrain dataset (15K 2G customers, 3K 3G customers) for training, validation, testing and the sinscore dataset (5K 2G customers, 1K 3G customers) for scoring. Hence there were 18,000 records for training and 6,000 for scoring. The data set (sintrain) had 251 variables. We observed that there were 38 categorical and 213 numeric variables. SAS identified the same as 79 class and 172 interval variables. The target variable CUSTOMER_TYPE was coded either 2G or 3G. The percentage of 2G and 3G customers in the sintrain data set were 83.33 % and 16.67 % respectively. In order to better understand the data, we looked at the distribution of all the variables and analyzed some of them by statistical tests for correlation, redundancy and significance based on their means. Missing Value and Miscoding After exploring the data, we found that the dataset had missing values in 10 variables. Table 1: Variables with Missing Values Name of variable Data Type % Missing DAYS_TO_CONTRACT_EXPIRY Interval 6% AGE Interval 3% TOT_DEBIT_SHARE Interval 2% OCCUP_CD Class 63 % MARITAL STATUS Class 6% CONTRACT_FLAG Class 6% PAY_METHOD Class 5% PAY_METHOD_PREV Class 5% HS_MANUFACTURER Class 3% HS_MODEL Class 3% Some variables, however, were not mentioned properly. For example, the description of AVG_CALL_MOB and AVG_MINS_MOB were described different from their variable names. Furthermore, variables STD_VAS_QG, STD_VAS_QI, STD_VAS_QP, STD_VAS_QTXT and STD_VAS_QTUNE were misspelled as STD_VAS_FG, STD_VAS_FI, STD_VAS_FP, STD_VAS_FTXT and STD_VAS_FTUNE in their description. Also 30 variables were rejected by default (in SAS EM version 4.3) since they were extremely skewed (unary: 24 variables, ordinal: 3 variables, binary: 3 variables). However, more details of missing value, miscoding, redundancy and many more are elaborated in the data preparation step. 3. Data Preparation Data Preparation is a key step in CRISP-DM to predict the best model. After exploring the data we found that we had to fix the outliers, redundant variables, miscoding, binning numeric variables, grouping classes for categorical variables, variables having missing values and hence applying appropriate transformations. Recoding & Data Type Adjustment On analyzing the sintrain dataset we found that the data types weren’t matching the data type in the description table of the variables. Hence some data type adjustments were made. In the 5 training dataset, we observed that 36 categorical variables were miscoded and hence we recoded them as binary (20) and nominal (16) variables. Missing Values Missing values seriously affect the data mining results as they may give invalid and unreasonable results. Hence, we looked at the variable distribution and made the decision of either replacing the missing data or rejecting it. For example, variable OCCUP_CD, a nominal variable had 63% missing data. Even though, we used replacement technique to solve this issue, we found that it still wasn’t staying in the final prediction model. As a result, we rejected variable OCCUP_CD from the input data source. Data Adjustment In the beginning there were 249 variables in the sintrain dataset. However, we rejected many variables based on various factors as described below. The rejected variables are tabulated in appendix 1. We thus retained certain variables by analyzing their importance in predicting the target variable. Our group finally rejected 109 variables and retained 141 variables in the input data source to run different models. There were six reasons analyzed by our team for rejecting certain variables. 1). Rejected by default in SAS EM version 4.3 because of extreme skew ness There were 24 unary, 3 ordinal and 3 binary variables rejected by default in SAS EM version 4.3. Extremely skewed variables don’t provide any difference among the data hence it has no effect on predicting the target variable. 2). Rejected based on t-test SPSS version 14.0 was used to find the statistical significance of variables based on t-tests. 19 variables were rejected based on t-test, since there were significant differences in mean having a p-value greater than 0.05. 3). Rejected based on skew ness (our team decision) Our group decided to reject 8 binary variables with skew ness over 98 %. After repeated trials we figured that these variables were not included in any of the prediction models built and hence rejected. 4). Rejected considering minutes is a better measure than calls (our team decision) We observed some redundant variables that would describe the same. We believed that the number of calls had a positive relation with minutes, so we decided to use just one type of variable for prediction. For example, we decided to use variable AVG_MINS as the input variables while rejecting AVG_CALL. We also rejected the standard deviation of the call variables. Finally we rejected 36 variables under this category. 5). Rejected because of redundancy in calculation and measurement Some variables were rejected based on redundancy. For example, the variable REVPAY_PREV_CD used a certain kind of coding for variable AVG_REVPAY_AMT. Hence, we decided to refer only variable AVG_REVPAY_AMT for prediction. Also, we used average variables like AVG_DIS_1900 and AVG_USAGE_DAYS for prediction instead of the total variables TOT_DIS_1900 and TOT_USAGE_DAYS. Also some variables were sum of certain variables (e.g. A = B + C). So, we ran different prediction models to see the importance of these variables and concluded that A was significant enough to predict the target variable. An example of this variable was AVG_BILL_VOICE which was the sum of AVG_BILL_VOICED and AVG_BILL_VOICEI. The same reasons were with the standard deviation of these variables. 6 6). Rejected because of improper data entry Variable AVG_CALL_MOB was described as the average number of mobile minutes in the last six month and AVG_MINS_MOB as the average number of mobile calls in the last 6 month. We observed that there were some errors in the data entry and hence rejected these variables. The STD_CALL_MOB and STD_MINS_MOB were also rejected based on the same reasons. 4. Modeling Our team used SAS Enterprise Miner version 4.3 to build prediction models. Data set attributes We specified CUSTOMER_TYPE as the target variable. In addition to the automatic attribute setting by SAS, we stated the role of variables based on their importance in prediction as explained in data exploration. Thus we included 141 total input variables into the prediction models. Sampling The stratified method of sampling was used and the variable CUSTOMER_TYPE was set to use in the status tab. A random seed of 2103 and a sample size of 6000 corresponding to 50% 2G and 50% 3G were used. We could have used any number less than 6000. But since only 3000 records were available for 3G, we used the entire 3000 records in the sample data set for prediction. Data partition In order to build the model, the dataset were split into a proportion of training (60%), validation (30%) and testing (10%) by using a random seed of 2103. We tried different combinations of data partition (70:20:10, 60:30:10, 50:40:10) for the data split. We also tried different random seeds and finally concluded after trying different models that seed 2103 and partition of 60:30:10 were the best among the random seeds and percent partition chosen. Model Building Process We started the model building process with the default data set given by PAKDD. The tree describing the model building process adopted by our team to come up with the final model is shown in appendix 2. 1. Default models: 217 input variables, 5 prediction models 2. Models after manually rejecting variables: 141 input variable, 5 prediction models 3. Models after replacements and transformations;141 input variables, 5prediction models Out team used five models of data mining techniques for prediction Logistic Regression (LR) Decision Tree (DT) a) CHAID algorithm (Chi-square, alpha= 0.05) b) C4.5 algorithm (Entropy reduction) c) CART algorithm (Gini reduction) Variable Selection (Decision Tree) followed by a) Logistic Regression b) Artificial Neural Network (MLP) Variable Selection (Chi-square/R-square) followed by a) Logistic Regression b) Decision Tree 7 c) Artificial Neural Network (MLP) Ensemble Model (a combination of the above 4) We built all the five prediction models as above and looked on their misclassification rates, accuracy, sensitivity and specificity. The misclassification rates, accuracy, sensitivity and specificity of the 5 models in the default dataset are shown in appendix 4. Then we used the data set with the variables retained after rejecting 109 variables based on the team’s decision. We observed that the prediction accuracy, sensitivity and specificity improved. The results for this model are shown in appendix 5. Finally we ran the prediction models with appropriate missing value replacement techniques and transformations. The appropriate transformation of variables of the best model is shown in appendix 7. We again observed an increase in the measures of accuracy, specificity and sensitivity as shown in appendix 6. The node diagram for building this model is shown in appendix 3. We thus selected 3 models based on the mean accuracy, sensitivity and specificity. 5. Evaluation In order to evaluate the best models, we decided to analyze the misclassification rates of the test dataset and lift charts of the models. The models were evaluated based on the accuracy measures, such as classification accuracy, sensitivity, and specificity. The results were achieved using crossvalidation for each model, and are based on results obtained from the validation dataset for the same. Once the confusion matrixes were constructed, the accuracy, sensitivity, and specificity of each model were calculated using the respective formulas. Accuracy = (TP+TN) / (TP+FP+TN+FN); sensitivity = TP / (TP+FN); specificity = TN / (TN+FP), where TP, TN, FP, and FN denotes true positives, true negatives, false positives, and false negatives, respectively. BEST MODEL Three models were chosen based on the mean accuracy, mean sensitivity and mean specificity. C4.5: Decision Tree (C 4.5 entropy algorithm) MLP by s: Variable selection method by Chi Square followed by Artificial Neural Network algorithm MLP by C: Variable selection method by Decision tree (C4.5 entropy algorithm) followed by Artificial Neural Network algorithm For the three best model types, the detailed prediction results of validation datasets are presented below. Table 2: The comparison of prediction results of validation datasets Measures on DT (C4.5) DT (C4.5) variable Chi Square variable validation selection followed by selection followed by ANN (MLP) ANN (MLP) 0.8117 0.8056 0.8050 Accuracy 0.7898 0.7419 0.8231 Sensitivity 0.8355 0.8690 0.7869 Specificity We found that the ANN model by variable selection using C4.5 algorithm achieved a classification accuracy of 0.8056 with a sensitivity of 0.7419 and a specificity of 0.8690. The ANN model by variable selection using χ2 achieved a classification accuracy of 0.8050, with a sensitivity of 0.8231 and a specificity of 0.7869. Moreover, the decision tree (C4.5) performed 8 the best of the three models evaluated. The decision tree (C4.5) achieved a classification accuracy of 0.8117 with a sensitivity of 0.7898 and a specificity of 0.8335. However, the differences of accuracy measures among these models are insignificant. Thus, we compared the three models by using lift chart as shown in appendix 8. We observed that 30% of cumulative customers from the MLP by C model identified almost 57% of the 3G customer type which was slightly higher than the other two models. Also, the profit was the highest in the MLP by C model than the other two considering the first 20 % of the cumulative customers. From the lift value chart of MLP by C model, for 47% of predicted probability, the 3G customers are equal or lower than what we would expect if we were to take a random sample. Thus we conclude that the DT (C4.5) variable selection followed by ANN (MLP) model as the best model for predicting the 3G customers. Rules in Decision tree In order to select some variables, we estimated the decision tree by using C4.5 algorithm that splits the input observation into two or more subgroups. This step had been repeated at each leaf node until the complete tree was constructed. The root node was split according to HS_MODEL, resulting in the first level. A total of 845 samples with handset model values, such as 11240, 10829, 10680, etc., were grouped into node 1. In contrary, 2755 samples with the other HS_MODEL values were grouped into node 2. Node 2 was then split according to the handset age in months (HS_AGE), yielding the second level of the tree. Grouped into node 3 were 677 samples with HS_AGE smaller than 3.5 and 2978 samples with the age values greater than or equal to 3.5 were grouped into node 4. Node 3 was split according to the DAYS_TO_CONTRACT_EXPIRY values. Grouped into Node 5 were 471 samples with the number of days to handset contract expiry date of less than 6.5. Whereas grouped into node 6 were 206 samples with number of days greater than or equal to 6.5. According to the current subscription plan type (SUBPLAN), node 4 was split into two parts: 518 samples of node 7 with 2102, 2136, 2248, etc., and 1560 sample of node 8 with the other values of the SUBPLAN. From node 6, HS_MODEL was used for splitting; thus, grouped were 104 samples of node 9, while grouped were 102 samples of node 10 with the HS_MODEL values, such as 10828, 10809, 10660, 11234, etc., Then from node 7, by the criterion of HS_MODEL, two nodes were split: node 11 and node 12. According to the average games utilization (Kb) last 6 months (AVG_VAS_GAMES), both node 13 and node 14 were spilt from the node 10. The criterion of splitting was 932203. Based on the criterion 150198.25 of the average GPRS Data utilization (kb) in last 6 months (AVG_VAS_GPRS), node 15 and node 16 were split from node 11. Finally, node 17 and node 19 were created by the previous subscription plan type from node 15. Accordingly, seven variables were selected: HS_MODEL, HS_AGE, SUBPLAN, DAYS_TO_CONTACT_EXPIRY, SUBPLAN_PREVIOUS, AVG_VAS_GPRS, and AVG_VAS_GAMES. Important variables in the final model After selecting variables by C4.5 algorithm, we created an ANN model with multilayer perceptron (MLP). We ran the ANN with 1 hidden layer and 1 hidden Neuron. The ANN model by variable selection using C4.5 decision tree algorithm provided information about the relative importance of the input variables in predicting customer type. The analysis performed the input variables by their relative importance. As shown in the appendix 9, the variable HS_MODEL_10214 was by far the most important variable in the prediction of the customer type. The information contained in this variable is concerned with the customer type. The second most important variable in the prediction was AVG_MIS0001, which was transformed from AVG_VAS_GPRS. The third most important variable was HS_A_70W0001:low, which was transformed from HS_AGE. Eventually, 11 variables were critical on the prediction of customer 9 type: HS_MODEL 10328, HS_MODEL 10326, HS_MODEL 10324, HS_MODEL 10316, HS_MODEL 10311, HS_MODEL 10214, HS_MODEL 10211, HS_A_70W0001:low, AVG_MIS0001:low, AVG_60QC0001:low, which was transformed from AVG_VAS_GAMES, and DAYS_TO_CONTRACT_EXPIRY. Scoring The best model (DT (C4.5) variable selection followed by ANN (MLP)) chosen above was connected to the scoring data set provided by PAKDD. By using the scoring node in SAS EM version 4.3, we determined the percentage of 3G and 2G customers as 77.32 % and 22.68 % respectively. Thus 22.68 % of the scoring data set would be 3G customer types. A tab delimited file named STL5.txt of the scored dataset is attached with this report. 6. Deployment For the business implication, if the company would like to spend money for the marketing campaign, it may consider this model for the same. In the training dataset there were 83.33 % of 2G customers and 16.67 % of 3G customers. However, our team selected the best model and classified the type of customers in the scored dataset as 77.32% 2G and 22.68% 3G customers. The manager should update this model after every campaign and make necessary changes for future prediction. If the manager believes that more variables are to be included or deleted based on his or her domain expertise then he may do so in the model. The best model was crosschecked on the sintrain dataset to check the sensitivity and accuracy of the model on the original dataset with 18, 000 records. The sensitivity and accuracy of our model in the sintrain dataset were 79.5 % and 84.05% respectively. 10 Financial terms From the 18,000 customers dataset we know the exact number of 3G customers types are 3000. Hence, assuming the cost of adverting is $1 per mailing, and we make $10 on each responds. Total cost would be $1 * 18,000 = $18,000 The revenue is $10 * 3,000 = $ 30,000 Thus profit is $ 12,000 However, if we perform the marketing campaign based on our best model Total cost would be $1*4.640 = $ 4,640 The revenue is $10 * 2,385 = $23,850 Thus profit is $ 19,210 Hence the company makes a profit of $7210.Thus the expenses will be decreased if the model predicted the correct customer market segments. Thus, in order to deploy data mining results in the business, the company has to monitor and maintain the data properly. Also, if there is a change in technology or network, the company should determine a new suitable prediction model. This will help the company know the new trend at that time and apply the right strategy to compete in the market. Conclusion A prediction model was built on the sintrain data set provided by PAKDD to predict the 3G customer types. A decision tree with C4.5 algorithm as a variable selection node followed by a Artifice Neural Network model (Multilayer perceptron) was the best model based on accuracy, sensitivity, specificity and lift charts. The handset model numbers, average GPRS data utilization (Kbs) in last 6 months, the age of the handset in months, average Games utilization in the last 6 months, and the days to the contract expiry were the major variables to predict the 3G customers. The manager may consider focusing these particular customer market segments to improve the buying potential of the customers to 3G network services. The 3G customers predicted in the holdout (sinscore) dataset were 22.68 % of customers. The tab delimited file of the holdout sample named STL5.txt is attached with this report. 11 APPENDIX Appendix 1 Rejected variables Rejected Reasons 1. Rejected by SAS default (unary and ordinal) Total Variables 30 2. Rejected based on t-test 19 3. Rejected based on skew ness of binary variables (> 98%) : our team decision 8 4. Rejected calls assuming that minutes is a better measure than calls and standard deviations for the same : our team decision 36 5. Rejected because redundancy in calculation of 9 6. Rejected because improper data entry of 4 Items HS_CHANGE, TELE_CHANGE_FLAG, TOT_PAST_DEMAND, VAS_VMN_FLAG, VAS_VMP_FLAG, VAS_INFOSRV, VAS_SN_FLAG, VAS_CSMS_FLAG, DELINQ_FREQ, AVG_VAS_QI, AVG_VAS_QP, AVG_VAS_QTXT, AVG_VAS_IDU, AVG_VAS_WLINK, AVG_VAS_MILL, AVG_VAS_IFSMS, AVG_VAS_123_, AVG_VAS_CG, AVG_VAS_IEM, AVG_VAS_ISMS, AVG_VAS_SS, STD_VAS_QI, STD_VAS_QP, STD_VAS_QTXT, STD_VAS_IDU, STD_VAS_WLINK, STD_VAS_MILL, STD_VAS_CG, STD_VAS_IFSMS, STD_VAS_123_, STD_VAS_IEM, STD_VAS_ISMS, STD_VAS_SS TOT_PAST_TOS, TOT_TOT_DAYS, OD_REL_SIZE, REVPAY_FREQ, AVG_VAS_QG, AVG_VAS_QTUNE, AVG_VAS_WAP, AVG_VAS_ESMS, AVG_VAS_GPSMS, AVG_VAS_YMSMS, AVG_EXTRAN_RATIO, AVG_SPHERE, STD_VAS_QG, STD_VAS_CWAP, STD_VAS_ESMS, STD_VAS_GPSMS, STD_VAS_YMSMS, STD_CALL_FRW_RATIO, STD_BUCKET_UTIL ID_CHANGE_FLAG, VAS_CND_FLAG, VAS_CNND_FLAG, VAS_DRIVE_FLAG, VAS_FF_FLAG, VAS_IB_FLAG, VAS_NR_FLAG, VAS_IEM_FLAG AVG_CALL, AVG_MINS_OB, AVG_CALL_OB, AVG_MINS_IB, AVG_CALL_IB, AVG_MINS_PK, AVG_CALL_PK, AVG_MINS_OP, AVG_CALL_OP, AVG_CALL_FIX, AVG_CALL_INTRAN, AVG_CALL_EXTRAN, AVG_CALL_LOCAL, AVG_CALL_T1, AVG_CALL_EXTRANTS, AVG_CALL_INT, AVG_CALL_FRW, AVG_CALL_1900, STD_CALL, STD_MINS_OB, STD_CALL_OB, STD_MINS_IB, STD_CALL_IB, STD_MINS_PK, STD_CALL_PK, STD_MINS_OP, STD_CALL_OP, STD_CALL_FIX, STD_CALL_INTRAN, STD_CALL_EXTRAN, STD_CALL_LOCAL, STD_CALL_T1, STD_CALL_EXTRANTS, STD_CALL_INT, STD_CALL_FRW, STD_CALL_1900 REVPAY_PREV_CD, TOT_DIS_1900, TOT_USAGE_DAYS AVG_BILL_VOICED, AVG_BILL_VOICEI, AVG_VAS_ARC STD_BILL_VOICED, STD_BILL_VOICEI, STD_VAS_ARC AVG_CALL_MOB, AVG_MINS_MOB, STD_CALL_MOB, STD_MINS_MOB 12 Appendix 2: Model Building Tree Model Building Tree Model after reject variables Default Models L R D T VS (DT) VS (χ2) VS (R2) E M L R D T VS (DT) VS (χ2) VS (R2) LR LR Model after reject, replace, and transform variables E M L R D T VS (DT) VS (χ2) MLP LR MLP LR LR MLP LR MLP LR LR CHAID MLP CART LR C4.5 VS (R2) CHAID CART MLP MLP C4.5 3 CHAID CART C4.5 MLP E M MLP Appendix 3 Final node diagram for the models built by appropriate replacement and transformations 3 Appendix 4 Default Models Data Mining Technique LR Sensitivity Specificity R2 EM C4.5 CART LR ANN (MLP) LR ANN (MLP) DT (CHAID) LR ANN (MLP) DT (CHAID) 0.2483 0.1927 0.2013 0.1991 0.2430 0.2425 0.2525 0.2383 0.1977 0.2488 0.22 0.2036 0.1841 0.26 0.2211 0.035 0.235 0.2488 0.2477 0.2611 0.2477 0.2122 0.2638 0.2522 0.2172 0.2027 Test 0.2616 0.2533 0.2616 0.2516 0.2416 0.255 0.2466 0.2483 0.2466 0.24 0.2583 0.2583 0.205 Validation 0.7400 0.7788 0.7644 0.765 0.7511 0.7522 0.7388 0.7522 0.7877 0.7361 0.7477 0.7827 0.7972 Test 0.7383 0.7466 0.7383 0.7483 0.7583 0.745 0.7533 0.7516 0.7533 0.76 0.7416 0.7416 0.795 Validation 0.6840 0.7619 0.7708 0.7717 0.6740 0.6440 0.7463 0.733 0.7775 0.7419 0.7163 0.7897 0.7764 Test 0.6913 0.7491 0.7717 0.7519 0.7170 0.6688 0.7877 0.7524 0.7524 0.7877 0.7299 0.7620 0.7877 Validation 0.7957 0.7957 0.7580 0.7780 0.8279 0.8601 0.7314 0.7713 0.7980 0.7302 0.7791 0.7758 0.8179 Test 0.7889 0.7439 0.7024 0.7231 0.8027 0.8269 0.7162 0.7508 0.7543 0.7301 0.7543 0.7197 0.8027 Train Validation Accuracy χ2 DT (CHAID) CHAID Values Misclassification Rate DT 3 Appendix 5 Model after Rejecting Variables Based on the Rules Set up by Our Team Data Mining Techniques Values Misclassification Rate Accuracy Sensitivity Specificity LR DT Variable Selection DT(C4.5) Variable Selection χ2 EM R2 CHAID C4.5 CART LR MLP LR MLP DT (C4.5) LR MLP DT (C4.5) 0.17 0.1552 0.1502 0.1705 0.1805 0.1688 0.1766 0.1666 0.1569 0.1808 0.1916 0.1719 0.1319 Validation 0.2333 0.2 0.1861 0.1988 0.23 0.2222 0.23 0.2311 0.1872 0.2316 0.2311 0.2011 0.1738 Test 0.255 0.241 0.21 0.2333 0.2483 0.245 0.2083 0.235 0.22 0.2283 0.255 0.2083 0.1966 Validation 0.7667 0.7900 0.8139 0.8011 0.7689 0.7778 0.7694 0.7689 0.8128 0.7683 0.7689 0.7989 0.8261 Test 0.745 0.7583 0.79 0.7666 0.7516 0.755 0.7616 0.765 0.78 0.7716 0.745 0.7916 0.8033 Validation 0.6819 0.7608 0.7853 0.8209 0.7642 0.7275 0.7575 0.7642 0.7864 0.7119 0.6808 0.7542 0.7686 Test 0.6655 0.7620 0.7620 0.8102 0.7427 0.7170 0.7684 0.7781 0.7588 0.7202 0.6559 0.7427 0.7588 Validation 0.8513 0.8191 0.8424 0.7736 0.7736 0.8280 0.7814 0.7736 0.8391 0.8246 0.8568 0.8435 0.8834 Test 0.8304 0.7543 0.8200 0.7197 0.7612 0.7958 0.7543 0.7508 0.8027 0.8269 0.7408 0.8442 0.8512 Train 3 Appendix 6 Models after Rejecting Variables Based on the Rules, Performing Missing Value Replacements, and Appropriate Transformation Data Mining Techniques Values Misclassification Rate Accuracy Sensitivity Specificity LR DT Variable Selection Variable Selection χ 2 DT(C4.5) EN R 2 CHAID C4.5 CART LR MLP LR MLP DT (C4.5) LR MLP DT (C4.5) Train 0.145 0.1616 0.1580 0.1769 0.1580 0.1516 0.1577 0.2125 0.1855 0.1736 0.1691 0.1855 0.2022 Validation 0.1944 0.1994 0.1883 0.2066 0.2083 0.1944 0.210 0.2372 0.215 0.2111 0.195 0.2211 0.2361 Test 0.2083 0.22 0.2216 0.24 0.205 0.2166 0.2083 0.2483 0.235 0.2266 0.2216 0.245 0.2666 Validation 0.8039 0.8056 0.8117 0.7933 0.7917 0.8056 0.7889 0.805 0.7789 0.7900 0.7628 0.785 0.7639 Test 0.7917 0.78 0.7783 0.76 0.795 0.7833 0.7734 0.7783 0.755 0.7917 0.7517 0.765 0.7333 Validation 0.743 0.7731 0.7898 0.8398 0.7508 0.7419 0.7453 0.8231 0.7864 0.7375 0.7308 0.8365 0.6463 Test 0.7395 0.7363 0.7588 0.8135 0.7588 0.7203 0.7428 0.8103 0.7814 0.7524 0.7428 0.8264 0.6238 Validation 0.8646 0.828 0.8335 0.7469 0.8324 0.869 0.8324 0.7869 0.7714 0.8424 0.7947 0.7336 0.8812 Test 0.8478 0.827 0.7993 0.7024 0.8339 0.8512 0.8062 0.7439 0.7266 0.834 0.7612 0.699 0.8512 17 PAKDD 2006 OSU-Group 5 Appendix 7 Transformation of variables in the final best model (Variable selection method by Decision tree (C4.5 entropy algorithm) followed by Artificial Neural Network algorithm) Variable Before transformation Transformation After transformation name used HS_AGE Binning Buckets = 2 AVG_VAS _GAMES Binning Buckets = 2 AVG_VAS _GPRS No effective transformation could be done. DAYS_TO_ CONTRAC T_EXPIRY No effective transformation could be done. 3 Appendix 8 Lift Charts C4.5: Decision Tree (C 4.5 entropy algorithm) MLP by s: Variable selection method by Chi Square followed by Artificial Neural Network algorithm MLP by C: Variable selection method by Decision tree (C4.5 entropy algorithm) followed by Artificial Neural Network algorithm Lift chart of % captured response on validation data- cumulative Lift chart of lift value on validation data-non cumulative 19 Lift chart of profit on validation data- cumulative Appendix 9 Results from the best model DT (C4.5) variable selection followed by ANN (MLP 20