PAKDD 2006 DATA MINING COMPETITION OPEN CATEGORY GROUP – STL6 CHARLES MCCANN VIVEK SASHIDHARAN FATHIMA SHARMILA SUHAIB THOMSON THAYIL 01 MARCH 2006 A. Approach and understanding of the problem We began our investigation of this topic by first determining what a 3G network could do and what was meant by the term 3G. Although we found various definitions the following seemed to best describe the problem being presented. 3G (or 3-G) is short for third-generation mobile telephone technology. The services associated with 3G provide the ability to transfer both voice data (a telephone call) and non-voice data (such as downloading information, exchanging email, and instant messaging). - Wikipedia From this definition we began to investigate the overall factors that we believed to be driving the model. Given this problem involves an Asian Telco operator who would like to identify their customers who are likely to switch to their 3G network, we began to research the factors involved in the overall 3G model specifically as it applies to the Asian market. From our investigation we believed that the following has significant impact on the model: 1) Current use of features which are directly enhanced by 3G features. As an example, those individuals who are currently surfing the web on their phone, we believe will be more likely to move to a 3G network, due to the fact that on a 3G network the performance will triple. 2) Individuals who are approaching the end of their contract life cycle are more likely to switch, because by signing contracts they can offset the cost of a new handset which is capable of supporting 3G features. 3) Individuals who pay high phone bills are less concerned with the switching cost. We then applied this basic understanding to determine if those variables added information to the model based on our understanding. B. Full technical details of algorithm(s) used Software SAS Enterprise Miner, Release 4.3 Sampling Algorithm The sampling algorithm used a stratified method to split the given data to have equal number of values of the binary target variable. Data Partition Algorithm A simple random data partition algorithm split the dataset into 60% training, 30% validation and 10% test with a random seed of 12345. Decision Tree Algorithm A decision tree with Gini reduction as the splitting criterion and CART as the pruning algorithm was used. This tree could have a maximum depth of 8 and minimum number of observations in a leaf of 18. The tree performed only a binary split. Replacement Algorithm For interval variables tree imputation algorithms determined missing values. Similarly, tree imputation algorithms and a default constant of “U” were used for missing values in categorical variables. Transformation Algorithm Continuous variables were transformed using a discretizing or binning algorithm. Artificial Neural Network Algorithm The predictive model was built using a feed forward neural network. This neural network had a Multilayer Perceptron (MLP) network architecture with a single hidden layer. Misclassification rate was used as the model selection criteria. C. Details of the classification model that was produced Choosing the training set The provided data set contained an unbalanced proportion of the target categorical variable “Customer_Type”. To better predict the target variable, a sampling algorithm created a dataset containing equal proportion of the target variable. To train the model, the dataset was further split by the data partition algorithm. Variable Selection A significant problem with the data set was the number of variables that would affect our model building process. To assist with the determination of the variables, a decision tree algorithm selected the significant variables that would be used for building the model. Figure 1 depicts the variables that were selected by the decision tree. Figure 1: Selected Variables Preparing the data Missing values were replaced using replacement algorithm and continuous values were normalized using the transformation algorithm. Building the model An Artificial Neural Network algorithm was used to build a model to predict the current customers that are likely to change from a 2G network to a 3G network. D. Discussion on insights gained from the model From our initial investigation about the problem domain at hand, we believed that the following variables would have an impact on the overall model: WAP utilization variables Instant messaging variables Online games variables Chat variables Web-surfing variables Contract expiration variable However, using various variable selection models, we found that the decision tree algorithm we chose yielded variables which lead to the “best” overall model. This model retained some of the variables we expected: Game variables, Contract expiration date, and Bill amount. The form of the information was somewhat different than our expectations. Examples are the three variables that appeared under the “Standard Deviation” category. These variables indicate to us that the consistency of usage over the last six months was an important factor in the model not the level of usage. From our classification matrix we observe that we have a false positive prediction rate between 18% and 21% (Refer to Figures 2-4). We also observed that our sensitivity rate was between 76% and 78% (Refer to Figures 2-4). This indicates that there are other factors impacting our prediction which do not currently appear in the model. From an explanatory basis given that we used a Neural Network solution for predicting the classification we are unable to produce any detailed explanation regarding the rules used to build the model. Figure 2: Training Classification Matrix Figure 3: Validation Classification Matrix Figure 4: Testing Classification Matrix