pakdd 2006 data mining competition

advertisement
PAKDD 2006 DATA MINING COMPETITION
OPEN CATEGORY
GROUP – STL6
CHARLES MCCANN
VIVEK SASHIDHARAN
FATHIMA SHARMILA SUHAIB
THOMSON THAYIL
01 MARCH 2006
A. Approach and understanding of the problem
We began our investigation of this topic by first determining what a 3G network
could do and what was meant by the term 3G. Although we found various definitions the
following seemed to best describe the problem being presented.
3G (or 3-G) is short for third-generation mobile telephone technology. The services
associated with 3G provide the ability to transfer both voice data (a telephone call) and
non-voice data (such as downloading information, exchanging email, and instant
messaging). - Wikipedia
From this definition we began to investigate the overall factors that we believed to
be driving the model. Given this problem involves an Asian Telco operator who would
like to identify their customers who are likely to switch to their 3G network, we began to
research the factors involved in the overall 3G model specifically as it applies to the
Asian market.
From our investigation we believed that the following has significant impact on the
model:
1) Current use of features which are directly enhanced by 3G features. As an
example, those individuals who are currently surfing the web on their phone, we
believe will be more likely to move to a 3G network, due to the fact that on a 3G
network the performance will triple.
2) Individuals who are approaching the end of their contract life cycle are more
likely to switch, because by signing contracts they can offset the cost of a new
handset which is capable of supporting 3G features.
3) Individuals who pay high phone bills are less concerned with the switching cost.
We then applied this basic understanding to determine if those variables added
information to the model based on our understanding.
B.
Full technical details of algorithm(s) used
Software
SAS Enterprise Miner, Release 4.3
Sampling Algorithm
The sampling algorithm used a stratified method to split the given data to have equal
number of values of the binary target variable.
Data Partition Algorithm
A simple random data partition algorithm split the dataset into 60% training, 30%
validation and 10% test with a random seed of 12345.
Decision Tree Algorithm
A decision tree with Gini reduction as the splitting criterion and CART as the pruning
algorithm was used. This tree could have a maximum depth of 8 and minimum number of
observations in a leaf of 18. The tree performed only a binary split.
Replacement Algorithm
For interval variables tree imputation algorithms determined missing values. Similarly,
tree imputation algorithms and a default constant of “U” were used for missing values in
categorical variables.
Transformation Algorithm
Continuous variables were transformed using a discretizing or binning algorithm.
Artificial Neural Network Algorithm
The predictive model was built using a feed forward neural network. This neural network
had a Multilayer Perceptron (MLP) network architecture with a single hidden layer.
Misclassification rate was used as the model selection criteria.
C. Details of the classification model that was produced
Choosing the training set
The provided data set contained an unbalanced proportion of the target categorical
variable “Customer_Type”. To better predict the target variable, a sampling algorithm
created a dataset containing equal proportion of the target variable. To train the model,
the dataset was further split by the data partition algorithm.
Variable Selection
A significant problem with the data set was the number of variables that would
affect our model building process. To assist with the determination of the variables, a
decision tree algorithm selected the significant variables that would be used for building
the model. Figure 1 depicts the variables that were selected by the decision tree.
Figure 1: Selected Variables
Preparing the data
Missing values were replaced using replacement algorithm and continuous values were
normalized using the transformation algorithm.
Building the model
An Artificial Neural Network algorithm was used to build a model to predict the current
customers that are likely to change from a 2G network to a 3G network.
D. Discussion on insights gained from the model
From our initial investigation about the problem domain at hand, we believed that
the following variables would have an impact on the overall model:
WAP utilization variables
Instant messaging variables
Online games variables
Chat variables
Web-surfing variables
Contract expiration variable
However, using various variable selection models, we found that the decision tree
algorithm we chose yielded variables which lead to the “best” overall model. This model
retained some of the variables we expected: Game variables, Contract expiration date,
and Bill amount. The form of the information was somewhat different than our
expectations. Examples are the three variables that appeared under the “Standard
Deviation” category. These variables indicate to us that the consistency of usage over the
last six months was an important factor in the model not the level of usage.
From our classification matrix we observe that we have a false positive prediction
rate between 18% and 21% (Refer to Figures 2-4). We also observed that our sensitivity
rate was between 76% and 78% (Refer to Figures 2-4). This indicates that there are other
factors impacting our prediction which do not currently appear in the model.
From an explanatory basis given that we used a Neural Network solution for
predicting the classification we are unable to produce any detailed explanation regarding
the rules used to build the model.
Figure 2: Training Classification Matrix
Figure 3: Validation Classification Matrix
Figure 4: Testing Classification Matrix
Download