PAKDD 2006 Data Mining Competition Singapore Submitted by: TUL3 BHARATH KONDURU SAURBH PAWAR SAYLEE SAKALIKAR VIJAY GHUGE DATE: Wednesday, March 1, 2006 Executive Summary Data mining is a nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It is the science of extracting the useful information from the large data sets or databases. It is usually associated with a business or other organization’s need to identify trends. An Asian Telco operator has successfully launched a third generation (3G) mobile telecommunications network and by using the existing customer usage and demographic data, it would like to identify which customers are likely to switch to using their 3G network. The objective of this analysis is to accurately predict as many current 3G customers as possible. SAS Enterprise Miner 4.3 was used to build the model and perform the analysis. In the task the group was presented with Training data set and Scoring data set. The training data had a list of customers whose customer type either 2G/3G was known. The first step began with the observation of the data set. On analysis of the data we found out the variables which would influence the decision of the customer for his/her transition from 2G to 3G technology based on correlation, intuition, technical facts and managerial judgment. On the basis of above criteria the group selected 62 variables from the data set of 252 variables to reduce the redundancy. For the observation of the missing values and distribution, an Insight Node was added and found that some variables in the data need replacement and transformation. Also an exhaustive scrutiny of the output of the insight node showed that the dataset was biased towards 2G customers. Thus to make an unbiased model the team sampled the data. The data was then portioned into Training, Validation and Testing data sets. The split chosen for partition was 65, 25 and 10% respectively of the Training Sample Data of 24000 network customers. Log transformation was done on variables having skewed distribution. Variables not having fair intervals of distributions were binned. Variables showing unknown values were replaced by default constant ‘None’. Tree imputation was performed to replace the missing values of all other variables. 1 After performing the above said modifications to improve the predictivity of the model the team started building the model with different combinations of algorithms. The various models built for prediction are as follows 1. Logistic Regression 2. Decision Tree 3. Artificial Neural Network 4. Ensemble model Of all the above models Decision Tree gave best prediction of 3G customers. The selection of the best model was done based on Sensitivity and Misclassification rates obtained for all three data sets. The observation of these parameters for the Decision Tree Model is shown below Model Decision Tree Training Validation Testing %MISC. %MISC. %MISC %Sensitivity Rate %Sensitivity Rate %Sensitivity Rate 79.52 21.57 79.09 21.47 77.85 24.33 Table1: Sensitivity and Misclassification rate of Gini 3 Way (Best Model) The Decision Tree Model selected the following variables HS_AGE, AVG_NO_CALLED, AVG_VAS_GAMES, NUMBER OF DAYS SINCE THE LAST RECEIVED RETENTION CAMP, TOP 2 INTERNATIONAL COUNTRY, AVG_BILL_VOICED, AVG_VAS_SMS, AVG_MINS_MOB, AVG_MINS_IB, STD_NO_CALLED, LINE_TENURE, are the most important factors to predict the current 3G customers. The data set from the Decision Tree Model was then scored using the Scoring Data Set to get the predicted serial numbers (IDs) of 3G customers 2 TABLE OF CONTENTS 1.0 2.0 Introduction …..…………………………………..4 1.1 Data Insight ……………………………………... 4 1.2 Sampling ………………………………………4 1.3 Data Partition ………………………………………4 1.4 Missing Value Replacement ………………………………………5 1.5 Variable Transformation ………………………………………5 Model Development ………………………………………6 2.1 Logistic Regression ………………………………............6 2.2 Decision Tree ………………………………………7 2.21 Splitting Criterion ………..……………………………..8 2.3 Artificial Neural Network …………………………………….. 11 2.4 Ensemble Model …………………………………….. 13 3.0 Scoring ……………………………………...15 4.0 Conclusion .……………………………………..16 5.0 References ……………………………………...17 6.0 Appendix ……………………………………….18 3 1.0 Introduction The team conducted the analysis of the data set provided by the Asian telco operator. SAS Enterprise Miner 4.3 was used for the analysis purpose. Different prediction models were built to find accurately 3G customers. Detailed report of the team findings and the models and algorithms used for this prediction are provided below. 1.1 Data Insight The data set provided to the team for the analysis contained 24000 records of customers with nearly 250 variables. The data set provided information of the Top 3 countries called, customer class, different types of calls made, services utilized like games, tunes, SMS, GPRS etc. Many variables showing redundant information were rejected manually. The variables which were set with model role as input are shown in the Appendix 2. The observed data set was biased, had missing values and some variables did not have normal distribution. So the variables were modified using the following. 1.2 Sampling As the data set was fairly large, sampling of data set was required to teach the model to make unbiased predictions. So the team used the sampling node to make balanced prediction of both 2G and 3G customers. Of the 18000 records of customers in training data set 3000 were of 3G customers. Thus to get an equal prediction of 2G and 3G customers the sample size was taken as 6000. Stratified sampling method was used for the same. 1.3 Data Partition The data set thus created was partitioned into training, validation and testing data sets. A split of 65, 25, and 10 was used respectively. The method used was simple random with a seed of 8108. This resulted in 3900 records in training, 1500 records in validation and 600 records in testing. 4 1.4 Missing Value Replacement As the data set contained many missing values, replacement was needed. The variables TOP 1 INTERNATIONAL COUNTRY, TOP 2 INTERNATIONAL COUNTRY, TOP3_INT_CD, MARITAL_STATUS, OCCUP_CD, had unknown values which were replaced with the default constant ‘None’. For all other variables Tree imputation was chosen for missing value replacement. Using the default tab the imputed indicator variables were created for missing values. The missing values of class and interval variables were replaced by Tree Imputation Method. 1.5 Variable Transformation: Considering the normality of the distribution of variables which were either positive/negative skewed or spiked at one end, some variables were identified for transformation. The transformations performed on these variables were Log transformation and binning. An example of Log transformation of AVG_NO_CALLED is shown below Figure 1: Log Transformation of AVG_NO_CALLED 5 Example of binning of the variable STD_MINS is shown below: Figure 2: Binning of STD_MINS 2.0 Model Development: The Data Set was analyzed by using statistical models provided by SAS Enterprise Miner. The various models built by the team were (see Appendix 1) Logistic Regression Decision Tree Artificial Neural Network Ensemble Model 2.1 Logistic Regression: The binary nature i.e. if a customer is 2G/3G of the target variable is the driving force for consideration of Logistic Regression. Logistic Regression describes the relationship between categorical response variables which can be binary, ordinal or nominal and the set of predictor variables. The Logistic Regression fits an ‘S’ shaped curve which better captures the binary nature. The graph below shows the most vital variables on the basis of Effect T-Scores. [1][2][3] The variables include HS_AGE, DAYS_TO _CONTRACT_EXPIRY, STD_VAS_GAMES, NUMBER OF DAYS SINCE THE LAST RECEIVED RETENTION CAMP, AVG_VAS_SMS. 6 Figure3: Most Important Variables influencing the predictivity The % sensitivity and % misclassification rate of the model is shown in the common table below 2.2 Decision Tree: Decision trees partition large amounts of data into smaller segments by applying a series of rules. Creating and evaluating decision trees benefits greatly from visualization of the trees and diagnostic measures of their effectiveness. Decision Trees discover unexpected relationships identify subdued relationship and can be used for categorical and continuous data, accommodate missing data. Decision tree helps us to visualize the tree at various levels of details and also help to examine diagnostic plots and statistics. For appropriate assessment of efforts of the model, the following specific values were set in the tree node. [4][5] 7 2.21 Splitting Criteria: On trying all the three purity measures viz Chi Square, Entropy Reduction, Gini Reduction, the latter was selected based on the following results: Model Chi-Sq 2 Way Chi-Sq 3 Way Entropy 2 Way Entropy 2 Way Gini 2 Way Gini 3 Way Training % %Misc. Sensitivity Rate 80.91 22.38 78.37 21.28 81.17 21.59 79.2 20.49 78.99 20.69 79.52 21.57 Validation % %Misc. Sensitivity Rate 80.05 23.4 77.2 21 77.97 23.53 77.72 21.67 78.23 21.47 79.09 21.47 Testing % %Misc. Sensitivity Rate 77.33 24.83 76 25.5 76.67 26 76..33 25.16 76.33 26.16 77.85 24.33 The minimum number of observations in a leaf, if too low, overfits the training data set and if too high, underfits the training data set and thus misses the relevant patterns in the data and hence the team chose 20 as the value for this parameter. The observations required for a split search parameter controls the depth of the tree which must be no less than twice the value of minimum number of observations in a leaf parameter and hence it was decided as 100. Maximum depth of the tree was chosen as 6 for detailed analysis. The trade off was made considering the consistency with the sensitivity and misclassification rate. Also Chi 2 Way and Entropy 2 Way have better sensitivity, do not show consistency in misclassification rate and the final output. Important Variables: 1. HS_AGE 2. AVG_MINS_MOB 3. LST_RETENTION_CAMP 4. TOP1_INT_CD 5. TOP2_INT_CD 6. AVG_VAS_GAMES 7. AVG_NO_CALLED 8 With the above said options and important variables, the rules obtained for this decision tree that give us majority of 3G responses are given below. 1. If Hand Set Age in months < 0.8958 2. If Hand Set Age in months < 2.1383 and Number Of Days Since The Last Received Retention Camp < 0.135 3. If Hand Set Age in months < 2.1383 and Number Of Days Since The Last Received Retention Camp < 0.095 and Average number of mobile calls in last 6 months >= 5.5490 4. If Hand Set Age in months < 2.1383 and Number Of Days Since The Last Received Retention Camp < 0.095 and Average number of mobile calls in last 6 months < 5.5490 and Average games utilization(KB) in last 6 months = 2.48E6High 5. If Hand Set Age in months < 2.1383 and Number Of Days Since The Last Received Retention Camp < 0.095 and Average number of mobile calls in last 6 months < 5.5490 and Average games utilization(KB) in last 6 months = Low472078 and Top 1 International Country = 29 6. If Hand Set Age in months >= 2.1383 and Average number of mobile calls in last 6 months > 4.6793 and Top 2 International Country = 258 7. If Hand Set Age in months >= 2.1383 and Average number of mobile calls in last 6 months > 4.6793 and Top 2 International Country = None and Top 1 International Country = 25 8. If Hand Set Age in months >= 2.1383 and Average number of mobile calls in last 6 months > 4.6793 and Top 2 International Country = None and Top 1 International Country = None and Average number of different numbers called in last 6 months >= 3.686 9 The number of leaves chosen by SAS for pruning was 18. Figure 4: Number of Leaves Pruned The % sensitivity and % misclassification rate of the model is shown in the comparison table below (Refer Table 2). 10 2.3 Artificial Neural Network: An artificial neural network is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") that usually carry numeric (as opposed to symbolic) data encoded by various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training. [5][6][7][8] Artificial Neural Networks are a class of flexible, nonlinear regression models, discriminant models, and data reduction models that are interconnected in a nonlinear dynamic system. Neural networks are useful tools for interrogating increasing volumes of data and for learning from examples to find patterns in data. The large number of records (24000) of customers and the variables (252) affecting them in the given data set led to complex nonlinear relationships in data. So, neural network was built to make accurate predictions about this data mining problem. The team built artificial neural network model with different architectures namely Multilayer Perception (MLP) Ordinary Radial Basis Function (RBF) with equal widths Ordinary Radial Basis Function (RBF) with unequal widths All the above three different architectures of neural network model were tried with different options in the neural network node of SAS Enterprise Miner. Based on the % sensitivity and % misclassification rate for the above three models, the team picked the neural network model with the architecture ‘Ordinary RBF with equal widths’. Model ANN MLP ANN RBF Eq ANN RBF UnEq Training % %Misc. Sensitivity Rate 71.27 23.15 71.02 23 70.23 24.41 Validation % %Misc. Sensitivity Rate 68.91 24.33 66.23 24.94 69.17 24.6 Testing % %Misc. Sensitivity Rate 67.33 26.86 67.79 26 68 26.66 Table 2: Sensitivity and Misclassification Rates of Various ANN models 11 In this model, the model selection criteria used was ‘Misclassification Rate’. Number of hidden neurons selected was 3. Randomization was preferred for scale estimates, target weights and target bias weights. The model showed the following plot with the important variables and the associated weights corresponding to each hidden neuron. Figure5: important variables and the associated weights corresponding to each hidden neuron. The % sensitivity and % misclassification rate of the model is shown in the comparison table below (Refer Table 2). 12 2.4 Ensemble Model The Ensemble node creates a new model by averaging the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple models. The new model is then used to score new data. The component models are then integrated by the Ensemble node to form a potentially stronger solution. The component models which were integrated to form the ensemble model were Logistic Regression, Decision Tree model with Gini 3 Way as purity measures and Artificial Neural Network with the architecture RBF Equal widths. [1] The % sensitivity and % misclassification rate of the model is shown in the common table below. Comparison of all models built. Model Training % Sensitivity Validation Testing % % % % Misc Misc Sensitivity Sensitivity Rate Rate % Misc Rate LR 77.94 19.92 72.6 23.53 72.48 25 Decision Tree ANN 79.52 71.02 21.57 23 79.09 66.23 21.47 24.94 77.85 67.79 24.33 26 Ensemble Model 79.25 49 78.93 51 73.2 51 Table 2: Comparison showing Sensitivity and Misclassification Rates of all models From the above comparison table, it can be seen that the % sensitivity and % misclassification rate of the Decision Tree model is better than all the other models and also the consistency of the decision tree over Training, Validation and Testing is better than all the other models. Also False Positive for the Decision Tree Model is 20.63% and can be seen from the output below. This means that 625 customers which are 3G were wrongly predicted as 2G which is least than all other models. 13 The FREQ Procedure Table of CUSTOMER_TYPE by I_CUSTOMER_TYPE CUSTOMER_TYPE (CUSTOMER_TYPE) I_CUSTOMER_TYPE (Into: CUSTOMER_TYPE) Frequency‚ Row Pct ‚2G ‚3G ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2G ‚ 12546 ‚ 2454 ‚ ‚ 83.64 ‚ 16.36 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3G ‚ 625 ‚ 2375 ‚ ‚ 20.83 ‚ 79.17 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 13171 4829 Total 15000 3000 18000 Lift Chart for all models: On the lift chart, it can be observed that all the models except the Ensemble model are performing almost similar in all the deciles. So, the selection of the model for final prediction required considering a trade-off between the lift chart result and the % sensitivity and % misclassification rate results seen in the comparison table. 14 Eventually, Decision Tree was selected as the final model for scoring as it showed better % sensitivity and % misclassification rate than the other models. 3.0 Scoring: Scoring is used to generate and manage the predicted values from the trained model using the formulae for prediction and assessment. For this problem, it was applied to a new “holdout” sample data set provided in order to predict if each record would yield a 2G or 3G customer. For scoring the data set from the decision tree model selected, a score node was attached to the tree node, an input data source node with the “holdout” sample provided was attached to the score node and the score node was ran to perform the action ‘Apply training data score code to score data set’ when the path was ran. Then, an insight node was attached to the score node and ran to get the results which were then saved in the selected library and then opened with ‘Analyst’ in SAS to get the table for the prediction of 2G and 3G customers, which is as follows: The FREQ Procedure Table of CUSTOMER_TYPE by I_CUSTOMER_TYPE CUSTOMER_TYPE(CUSTOMER_TYPE) I_CUSTOMER_TYPE(Into: CUSTOMER_TYPE) Frequency‚ Row Pct ‚2G ‚3G ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2G ‚ 12546 ‚ 2454 ‚ ‚ 83.64 ‚ 16.36 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3G ‚ 625 ‚ 2375 ‚ ‚ 20.83 ‚ 79.17 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 13171 4829 Total 15000 3000 18000 Also, the file opened in SAS Analyst was saved as a delimited file keeping only two fields SERAIL_NUMBER (ID Field) and I_CUSTOMER_TYPE (Prediction for the target field ‘Customer Type’ (2G/3G) from the holdout sample) and deleting all other fields. 15 4.0 Conclusion: Thus based on the above findings we can conclude that the Decision Tree with Chi 3 Way purity measures is the best model with consistent results. Though we had to make few trade offs with respect to sensitivity and misclassification the number of 3G customers 1423 current or prospective was optimum as compared to other models. Thus this SAS Singapore Data Mining Competition gave us a feel of the real world data and gave us an opportunity to work on real time data. This experience has helped to bridge the gap between theory and practice. Financial Interpretations: Assumptions: Cost of sending coupons per person = $1 Total number of customers = 6000 Revenue per product purchased by the customer = $10 Cost of mailing the 6000 customers = 6000* 1 = $ 6000 When coupons were sent to total number of customers and we know that only 1000 respond from them then profit derived from it is Profit = 1000*$10 - 6000 = $4000 Thus, we land up with a profit of $4000. But by using the Decision Tree model which predicts 1423 potential 3G customers and assuming that model is 100% accurate in predicting the actual 1000 customers, then mailing cost to 1423 customers = $1423. Profit = 1000*$10 - $1423 = $8577 Thus, we land up with a profit of $8577. This is for a dataset of 6000 customers. For a dataset with a million customer records the profits will be humongous. 5.0 References: 1. www.ats.ucla.edu/stat/ sas/topics/logistic_regression.htm 2. www.indiana.edu/~statmath/stat/all/cat/1b1.htm 3. luna.cas.usf.edu/~mbrannic/ files/regression/Logistic.html 16 4. www.sas.com/technologies/analytics/ datamining/miner/dec_trees.html 5. www.sas.com/offices/asiapacific/ sp/training/courses/dmdt.html 6. citeseer.ist.psu.edu/36580.htm 7. www.sas.com/technologies/analytics/ datamining/miner/neuralnet.html 8. www.sas.com/offices/asiapacific/ sp/training/courses/dmnn.html 9. dimacs.rutgers.edu/Workshops/ AdverseEvent/slides/stultz.ppt 6.0 Appendix: 1 17 Appendix: 2 Variables selected manually 18 Name Model Role Measurement Type SERIAL_NUMBER id nominal char $4.00 $4.00 AGE input interval num BEST12. 12 AGE AVG_BILL_AMT input interval num BEST12. 12 AVG_BILL_AMT AVG_BILL_VOICED input interval num BEST12. 12 AVG_BILL_VOICED AVG_CALL input interval num BEST12. 12 AVG_CALL AVG_M2M_CALL_RATIO input interval num BEST12. 12 AVG_M2M_CALL_RATIO AVG_MINS input interval num BEST12. 12 AVG_MINS AVG_MINS_IB input interval num BEST12. 12 AVG_MINS_IB AVG_MINS_MOB input interval num BEST12. 12 AVG_MINS_MOB AVG_NO_CALLED input interval num BEST12. 12 AVG_NO_CALLED AVG_OP_MINS_RATIO input interval num BEST12. 12 AVG_OP_MINS_RATIO AVG_PK_CALL_RATIO input interval num BEST12. 12 AVG_PK_CALL_RATIO AVG_T1_CALL_CON input interval num BEST12. 12 AVG_T1_CALL_CON AVG_VAS_GAMES input interval num BEST12. 12 AVG_VAS_GAMES AVG_VAS_QG input interval num BEST12. 12 AVG_VAS_QG AVG_VAS_QP input interval num BEST12. 12 AVG_VAS_QP AVG_VAS_QTUNE input interval num BEST12. 12 AVG_VAS_QTUNE AVG_VAS_SMS input interval num BEST12. 12 AVG_VAS_SMS AVG_VAS_XP input interval num BEST12. 12 AVG_VAS_XP BLACK_LIST_FLAG input binary num BEST12. 12 BLACK_LIST_FLAG COBRAND_CARD_FLAG input binary num BEST12. 12 COBRAND_CARD_FLAG CUSTOMER_CLASS input ordinal num BEST12. 12 CUSTOMER_CLASS DAYS_TO_CONTRACT_EXPIRY input interval num BEST12. 12 DAYS_TO_CONTRACT_EXPIRY GENDER input binary char $1.00 $1.00 HIGHEND_PROGRAM_FLAG input binary num BEST12. 12 HIGHEND_PROGRAM_FLAG HS_AGE input interval num BEST12. 12 HS_AGE HS_MODEL input nominal num BEST12. 12 HS_MODEL LINE_TENURE input interval num BEST12. 12 LINE_TENURE LOYALTY_POINTS_USAGE input interval num BEST12. 12 LOYALTY_POINTS_USAGE LST_RETENTION_CAMP input interval num BEST12. 12 LST_RETENTION_CAMP LUCKY_NO_FLAG input binary num BEST12. 12 LUCKY_NO_FLAG MARITAL_STATUS input binary char $1.00 $1.00 MARITAL_STATUS NATIONALITY input nominal num BEST12. 12 NATIONALITY NUM_TEL input interval num BEST12. 12 NUM_TEL OCCUP_CD input nominal char $4.00 $4.00 OCCUP_CD PAY_METD input nominal char $2.00 $2.00 PAY_METD STD_CALL input interval num BEST12. 12 STD_CALL STD_MINS input interval num BEST12. 12 STD_MINS STD_MINS_INTT1 input interval num BEST12. 12 STD_MINS_INTT1 STD_MINS_INTT2 input interval num BEST12. 12 STD_MINS_INTT2 STD_MINS_INTT3 input interval num BEST12. 12 STD_MINS_INTT3 STD_NO_CALLED input interval num BEST12. 12 STD_NO_CALLED STD_OP_CALL_RATIO input interval num BEST12. 12 STD_OP_CALL_RATIO STD_T1_MINS_CON input interval num BEST12. 12 STD_T1_MINS_CON STD_VAS_AR input interval num BEST12. 12 STD_VAS_AR STD_VAS_GAMES input interval num BEST12. 12 STD_VAS_GAMES STD_VAS_SMS input interval num BEST12. 12 STD_VAS_SMS Format Informat Variable Label SERIAL_NUMBER GENDER 19 SUBPLAN input interval num BEST12. 12 SUBPLAN TOP1_INT_CD input nominal char $4.00 $4.00 TOP1_INT_CD TOP2_INT_CD input nominal char $4.00 $4.00 TOP2_INT_CD TOP3_INT_CD input nominal char $4.00 $4.00 TOP3_INT_CD TOT_DELINQ_DAYS input interval num BEST12. 12 TOT_DELINQ_DAYS TOT_DIS_1900 input interval num BEST12. 12 TOT_DIS_1900 TOT_PAST_REVPAY input ordinal num BEST12. 12 TOT_PAST_REVPAY TOT_PAST_TOS input interval num BEST12. 12 TOT_PAST_TOS VAS_AR_FLAG input binary num BEST12. 12 VAS_AR_FLAG VAS_CND_FLAG input binary num BEST12. 12 VAS_CND_FLAG VAS_GPRS_FLAG input binary num BEST12. 12 VAS_GPRS_FLAG VAS_IB_FLAG input binary num BEST12. 12 VAS_IB_FLAG VAS_IEM_FLAG input binary num BEST12. 12 VAS_IEM_FLAG VAS_VM_FLAG input binary num BEST12. 12 VAS_VM_FLAG . 20