Executive Summary

advertisement
PAKDD 2006
Data Mining Competition
Submitted By: Tulsa Group #4
Abhay Barapatre
Soumitra Rayarikar
Abhijit Sadhu
01/03/2006
Stillwater, OK
Executive Summary
Following the successful launch of a third generation (3G) telecommunications network,
an Asian telco operator wants to identify the probable customers who would like to
switch to using their 3G network. This data set contains the information about various
attributes of existing customers such as age, sex, subscription plan, model of mobile
handset used etc.
In order to help the telecom company identify the probable customers, we built predictive
models using various modeling techniques like Logistic Regression (LR), Decision Tree
(DT) Analysis, Artificial Neural Network (ANN) or their combination (Refer Report for
details on all these techniques). The models were built using the variables from the data
set containing information about existing customer usage and demography. Before
building the models we studied the data and it was seen that there were a number of
redundant and co-related variables in the data. We eliminated all such variables before we
began the model building process. However, there were a few variables that we thought
are really important in predicting whether a customer will be interested in buying 3G
network. Accessing and using 3G network services requires high-end mobile phones
having the latest technological features such as high data transfer rate, multimedia,
software like JAVA platform etc. Based on this knowledge we made the assumption that
customers shifting from 2G to 3G are very likely to use high-end handset models that are
relatively new in the market and they can be important factors while deciding whether
he/she will use 3G network communication. Thus, a few variables like handset age and
handset model were forced into all the models irrespective of the whether they were
statistically chosen by the models or not. After choosing the important variables and
rejecting the redundant ones we used a total of 18,000 customer records to build the
models and to test them.
We started with building simple models like Logistic Regression and various Decision
Trees to predict the customers who were likely to shift to 3G. However, since these
models did not perform satisfactorily well when tested, we moved on to more complex
methods of building models. In these methods we built different types of Artificial Neural
Networks which were based on DT or Variable Selection methods, i.e. after choosing the
variables manually we fed them to the DT or Variable Selection in order to trim the
number of variables further down, before building the ANN model. This helped us to
choose only the variables that were really affecting the probability whether a customer
will make the shift or not. Also, in the end, we combined the ANN models to make an
Ensemble model. The details of some of the best performing models are shown in the
table below.
Table1: Comparison of various Model Performances
Model
LR model
Gini 3w
ANN DT 3w gini
ANN VS Chi sq
Ensemble
Misclassification
Training Validation
27.3
15.59
17.51
18.48
17.00
26.85
19.56
19.55
19.8
19.00
Sensitivity
Training Validation
59.86
75.51
83.24
78.84
80.90
62.38
73.31
81.52
79.48
79.32
Specificity
Accuracy
Training Validation Training Validation
85.66
93.3
81.72
84.22
86.05
83.64
87.86
80.53
83.18
82.73
72.7
84.41
82.49
81.52
83.00
73.15
80.44
80.45
80.2
81.00
From the above table we see that the Ensemble model outperforms all other models. This
can be said based on the fact that this model has a relatively low value of
misclassification rate with a reasonably high sensitivity and specificity values.
Table 2: Table showing the relation between Sensitivity and Total 3G Predictions
ANN DT 3w gini
Ensemble
Actual
Correctly
Sensitivity
Customers
Predicted 3G
(Given)
83.24
832
1000
80.90
810
1000
3G Total
3G
Predictions based
on Model
1949
1664
2500
2000
1949
1748
1664
1500
Sensitivity in Percentage
1000
Total 3G Predictions
500
83.24
78.84
80.90
0
ANN DT 3w gini
ANN VS Chi sq
Ensemble
M ode ls
Fig1: Graph showing the trade-off between the sensitivity of a model and the total
number of 3G customers predicted
From the above graph we notice that there is a trade-off between the total number of
predicted 3G customers and the number of 3G customers correctly predicted. In other
words, as the sensitivity increases, the total 3G predictions also increase. This would
mean that in order to predict a larger number of probable customers correctly, we would
have to target a larger customer base. This would in turn lead to higher costs. Thus, even
though the sensitivity of the ANN model is more than that of Ensemble model, we select
the latter. As seen in Figure 1 and Table 2, we will have to target an extra 288 people in
order to get 22 more customers while using ANN Gini 3way model. This is not justified
because the cost of targeting the extra people might far exceed the profits obtained from
the few extra customers gained. Thus, we selected the Ensemble model which has a good
balance of the sensitivity and the total number of predicted 3G customers.
The lift chart (shown in figure 9) validates our decision further. The lift chart is a plot of
the people in deciles versus the customers shifting to 3G network services. For example,
if we pick the top 50% of the people with the highest propensity based on Ensemble
model, then the percentage of 3G customers will be about 83%, whereas, for the same
50% the response rate is 81% for ANN using Gini 3way method
Conclusion
From the Ensemble model that was built, it can be seen that the variables handset model
and handset age are the most important variables. Thus, as stated earlier, the company
should focus on customers having relatively new handsets having the latest technologies.
Apart from these two variables there are several others like subscription plan used,
average games utilization in last six months, number of days to handset expiry etc. (Refer
Figure 5 for full list of variables) have been selected. The management should focus on
these variables in order to increase their sales 3G network communications.
Model Building: Approach and Understanding
Introduction
An Asian Telco is planning to launch a third generation (3G) telecommunications
network. They want to target the probable customers using the customer data set
containing information about customers’ usage and demography. This data set contains
the information about various attributes of existing customers such as age, sex,
subscription plan, model of mobile handset used etc.
To determine the probable customers we built predictive models based on various
Modeling Techniques like Logistic Regression, Decision Tree Analysis, Artificial Neural
Network, and Ensemble models. (For a brief note on these techniques, what they are, how
they work, Refer Modeling Techniques)
Data Exploration
An original sample dataset of 20,000 2G network customers and 4,000 3G network
customers has been provided with 251 data fields (Variables). The target categorical
variable is “Customer_Type” (2G/3G), which is a binary variable as it takes the value 2G
or 3G. A 3G customer is defined as a customer who has a 3G Subscriber Identity Module
(SIM) card and is currently using a 3G network compatible mobile phone.
Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is
meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be
made available with the target field missing and is meant to be used for prediction.
Tools Used
SAS Enterprise Miner 4.3
Modeling Techniques Used
Logistic Regression
Decision Tree
Artificial Neural Networks
Ensemble
Details of Algorithm Used
Data Modification
In the data set we identified 150 variables which are either redundant or which does not
have impact on the decision or target variable. We did this after careful analysis of data.
Most of the data fields have average, total and standard deviation values for the same
parameter. We kept average values of theses parameters, and manually removed the std.
deviation and total values of the parameters such as payment methods, average bill, no of
calls etc. which were kept as data fields in the data set. This ensured that no dummy
variables and correlated variables are present in the model building process. Some
nominal variables were placed as interval category variable in the dataset provided. We
changed it manually to nominal category variables in the dataset. This was done in initial
input node of the model.
Replacement of Missing Values
Replacement node belongs to the Modify category of the SAS SEMMA (Sample,
Explore, Modify, Model, Assess) data mining process. We use the Replacement node to
replace missing values and to trim specified non missing values in data sets that are used
for data mining. We found that there were many variables which had significantly high
missing values, we changed them using Tree imputation techniques for both class and
interval scale variables.
Sampling
The Sampling node belongs to the Sample category of the SAS SEMMA (Sample,
Explore, Modify, Model, Assess) data mining process. Sampling node is used to extract
a sample of input data source. Sampling is done for extremely large databases because it
tremendously decreases model fitting time. As long as the sample is sufficiently
representative, patterns that appear in the data as a whole will be traceable in the sample.
Sampling also closes the gap between huge data sets and human limitations.
An original sample dataset of 20,000 2G network customers and 4,000 3G network
customers, hence it can be seen that the dataset is biased towards 2G network customers.
To remove this bias we used sampling node. We used dataset of strength 6000 and used
equal number 2G and 3G network customers to predict the type of customers who will be
going for the 3G network customers. We used the seed of 5837. This dataset was passed
further for model building.
Data Partition
After sampling the data, the data is usually partitioned before modeling. Use the Data
Partition node to partition input data into one of the following data sets:
Train
is used for preliminary model fitting. The analyst attempts to find the best
model weights using this data set.
Validation is used to assess the adequacy of the model in the Model Manager and in the
Assessment node. The validation data set is also used for model fine-tuning.
We split the data into 65% training, 25% validation and 10% testing.
Data Transformation
The Transform Variables node enables us to create new variables that are
transformations of existing variables in the dataset. Transformations are useful when we
want to improve the fit of a model to the data. For example, transformations can be used
to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality
in variables. After observing the distribution of we observe that most of the variables
were not normally distributed we used log and Bucket transformation mostly for these
variables.
Some
the
major
variables
transformed
were
HS_AGE,
DAYS_TO_CONTRACT_EXPIRY AVG_AMT_PAID etc. (Refer Appendix I for some
of the transformations)
Variable Selection
The Variable Selection node also belongs to the Explore category of the SAS SEMMA
(Sample, Explore, Modify, Model, Assess) data mining process. The given databases
has two hundred and fifty one variables out of which we have identified around 100
potential model inputs (independent or exploratory variables) that can be used to predict
the target ( response variable). Using the Variable Selection node we reduced the
number of inputs rejecting input variables that are not related to the target. Although
rejected variables are passed to subsequent nodes in the process flow diagram, they are
not used as model inputs by successor modeling nodes. The Variable Selection node
quickly identifies input variables which are useful for predicting the target variable(s) and
are assigned input roles. We also selected around 30 variables and forced them in the
model by manual selection process.
The 42 final variables selected after this node were as follows (sample).
HS_MODEL,
HS_AGE
SUBPLAN
DAYS_TO_CONTRACT_EXPIRY
AVG_VAS_GAMES
AVG_CALL_OB
AVG_MINS_INTRAN
LINE_TENURE
..
..
..
…
Screen Shot From SAS Enterprise Miner (Figure 2)
Logistic Regression
Since our target variable is a binary categorical variable, the first model we build was the
Logistic Regression Model.
Logistic regression attempts to predict the probability that a categorical target will
acquire the event of interest as a function of one or more independent inputs.
First we built the LR model using stepwise selection method and significance level of
5%. The output is shown as below:
Output for LR (Taken as Screenshot from SAS-Enterprise Miner 4.3) (Figure 3)
But as we can see the sensitivity was very low. We tried different combination in logistic
regression model. We used Forward Selection method, Backward elimination but there
was no improvement in Misclassification values, Sensitivity and Accuracy values. We
also observed that variable selection done by this model is also poor. The following are
the results of the best LR Model built.
Misclassification
Rate
Sensitivity
Specificity
Training Validation Training Validation Training Validation
Best
LR
model
27.30%
26.85
59.86%
62.38%
We proceeded further to build the Decision tree models.
85.66%
83.64%
Decision Tree Model
An empirical tree represents a segmentation of the data that is created by applying a
series of simple rules. Each rule assigns an observation to a segment based on the value
of one input. One rule is applied after another, resulting in a hierarchy of segments within
segments.
After reviewing the LR models we went on to build Decision tree models by selecting
various criteria for building the tree such as Chi square, Gini Index and Entropy selection
method. The following table gives the summary of various decision tree models.
Model
Chi-Sq 2w
Chi-Sq 3w
Entropy 2w
Entropy 3w
Gini 2w
Gini 3w
Misclassification
Sensitivity
Specificity
Accuracy
Training
Validation Training Validation Training Validation Training Validation
22.33
24.56
77.32
76.96
78
73.9
77.67
75.44
21.26
23.39
81.12
81.2
76.44
71.77
78.74
76.61
23.67
24.94
73.3
73.1
79.26
77.09
76.33
75.05
21.79
24.33
74.45
73.53
81.84
77.89
78.21
75.67
23.56
24.89
73.87
73.2
78.91
77.09
76.43
75.05
21.59
23.67
79.87
79.84
76.99
72.67
78.41
76.33
It is observed that there is not much difference in the Misclassification rate, Sensitivity
and Specificity values of Training and validation data. Hence we can say that our Model
is stable and behaves in the same way for both training and validation data.
The variables selected in the Best Decision Tree Model are
Figure 4
The tree built on gives us the information that the people who have 3G network have
 Handset Model 10828, Age < 0.5 Months, with good games utilization feature
 Handset Model 10828, Age <5.5 months, less than 522 days to contract expiry
and outbound call facility.
 Handset Model 10829 with the entire internet based features.
But we felt that we should go ahead try for better values of Sensitivity and low values of
Misclassification rate.
Hence we decided to build the Artificial Neural Network Models.
Artificial Neural Network (ANN) Model
Artificial Neural Networks (ANN) explains the process of storing information as patterns,
utilizing those patterns to find out possible the customers to be target for the new product
in the present Problem.
We build many Artificial Network Model and we have selected the following two models
based on better values of Sensitivity
Artificial Neural Network (ANN) Model on Gini Three way Split &
Artificial Neural Network (ANN) Model on Variable selection node
Misclassification
Rate
Training Validation
ANN DT 3w
gini
17.51%
ANN VS Chi
sq
18.48
Sensitivity Specificity Seed Used
19.55%
83.24%
81.72%
52132.00
19.8
78.84
84.22
95509
It is observed that there is not much difference in the Misclassification rate, Sensitivity
and Specificity values of Training and validation data. Hence we can say that our Model
is stable and behaves in the same way for both training and validation data.
The variable Selected from both the Models are.
Artificial Neural Network (ANN) Model on Gini Three way Split:
List of variables: (Screenshot of SAS Enterprise Guide) (Figure 5)
Artificial Neural Network (ANN) Model on Variable selection node
List of variables: (Screenshot of SAS Enterprise Guide) (Figure 6)
Ensemble
The Ensemble node also belongs to the Model category of the SAS SEMMA (Sample,
Explore, Modify, Model, and Assess) data mining process. The Ensemble node creates a
new model by averaging the posterior probabilities (for class targets) or the predicted
values (for interval targets) from multiple models that precede the Ensemble node in a
process flow diagram. The new model that the Ensemble node creates is then used to
score new data. The two models we used for Ensemble are
Artificial Neural Network (ANN) Model on Gini Three way Split &
Artificial Neural Network (ANN) Model on Variable selection node
Screenshot of the Output from SAS Enterprise Miner (Figure 7)
Ensemble
Misclassification
Rate
Training Validation
17%
19%
Sensitivity Specificity
80.90%
85.55%
It is observed that there is not much difference in the Misclassification rate, Sensitivity
and Specificity values of Training and validation data. Hence we can say that our Model
is stable and behaves in the same way for both training and validation data.
Screenshot of the Output from SAS Enterprise Miner When Scoring was done on
target field (Figure 8)
From the output we can say that of the in the holdout data our model will predict 1664
customers who have 3G network with 80.9 % sensitivity.
Lift Charts
(Figure 9)
This a plot percentile of people in deciles versus response rate of the data sample for 3G
network customers. If we pick the top 50% of the people in with the highest propensity
based on ensemble model the response rate is around 83% , whereas , for the same 50%
response rate is around 80% Artificial Neural Network by variable selection method and
81% for Artificial Neural Network by Gini testing method for decision tree.
Appendix 1
Days_to_Contract Expiry
Before
Avg_Bill_Amount
Avg_Bill_Voiced
After
Download