Open01

advertisement
TUL6
Analysis for the Asian Telecommunications Provider
Table of Contents
Executive Summary ………………………………………………………………………3
Parameters used in SAS EM ………………………………………………………………5
Reviewing and Sampling
…………………………………………………………..…..5
Data Partitioning, Variable Selection, Replacement, and Transformation
………………5
Modeling
………………………………………………………………………………6-10
Conclusion
………………………………………………………………………………11
Diagram of Data Flow……………………………………………………………………....12
2
Executive Summary
Background
An Asian telecommunication provider recently launched the first phase of a third
generation (3G) mobile telecommunications network. The company now has both second
generation (2G) and 3G subscribers. The company would like to use existing customer usage and
demographic data to identify which 2G customers are most likely to upgrade or switch to the 3G
technology. The sample dataset provided to the consulting firm has 20,000 2G network users
and 4,000 3G network users with more than 200 data fields for each subscriber. Three quarters
of the database (18,000 records of 15,000 2G and 3,000 3G) have target variable values
available. These records should be used for training/validation/testing of models. The remaining
6,000 records presumably contain 5,000 2G and 1,000 3G customers. The target values are
missing for this part of the database. The goal is to accurately predict as many 3G customers in
the scoring database as possible.
Methodology
The process follows this sequence: data review and sampling, data partitioning, variable
selection, modeling, and model evaluation and selection. The model selection process used
charts to determine specificity, sensitivity, and accuracy (or misclassification).
The original database was skewed toward 2G customers at the rate of 83.4% to 16.6% for
3G customers. To properly train the predictive models a sampling of the main database was
created with equal numbers of 2G and 3G customers. This allowed the models to learn both 2G
and 3G customer characteristics more effectively. This database was partitioned into training,
validation and testing sub-sets so that each model “checks” its learning and assumptions.
Each customer record contained over 250 variables. Several variable selections methods
were used to reduce the number of variables: manual variable selection, decision tree variable
selection and SAS variable selection. After thorough evaluation of each variable selection
process, we selected the SAS variable selection tool because it provided better results with the
models for this dataset. The 250 variables were narrowed down to 31 for use with the models.
Three types of models were used to predict the 3G customers: Logistic Regression (LR),
Decision Trees (DT) and Artificial Neural Networks (ANN). A fourth ensemble model was also
used, which is a composite or combining of the individual models. Decision Trees consistently
outperformed both Logistic Regression and Artificial Neural Networks in terms of “true
positives” prediction (or sensitivity based on ROC chart) and misclassification rates. Both two
node and multiple node decision trees were evaluated and tested with the three node decision
trees yielding the better results. The Ensemble model, based on two Decision Trees (Gini and
Entropy) provided even better accuracy than any one single decision tree.
Each model was evaluated using several criteria: the ROC chart (a measure of sensitivity
and specificity), Captured Response Rate and the Lift Chart. The DT model consistently
performed better than Logistic Regression and Artificial Neural Networks in terms of
misclassification rate and, more importantly, sensitivity. The Ensemble model comprised of two
Decision Trees performed better in all of the measurements than these models did separately.
Findings
Consistent with the scope of the project, the Gini and Entropy Decision trees, combined
in an ensemble model, more accurately predicted 3G customers within the original training,
validation and testing data. ROC charts were used to evaluate sensitivity, misclassification rates
were used as estimates of overall error rates, and cumulative and non-cumulative captured
response charts provided general comparison of models’ performance.
3
Conclusion
We recommend the use of an Ensemble Model, which combines two Decision Tree
models (Gini and Entropy) to correctly predict which 2G customers will be most likely to
upgrade or switch to the 3G technology. The management team and the marketing department of
the Asian Telecommunication Company can use the developed model to proceed with costbenefit analysis. The final selection of a model depends on marketing budgets and management
choice. The next recommended step would be project budgeting analysis based on suggested
model data and using fixed, variable costs and profit per customer to find break-even point and
profit/loss at different predicted 3G levels.
4
Parameters used in the SAS EM
Seed ID: 12345
Number of observations used: complete dataset: 18,000 records
Number of dataset variables: 250
Partitioning: 60% - training, 30% - validation, 10% - test
SAS Enterprise Miner version 4.3
Modeling methods: Chi-Square 3-Node Decision Tree, Entropy Reduction 3-Node Decision
Tree, Gini Reduction 3-Node Decision Tree, Logistic Regression, Multi-Layer Perceptron
(MLP), Artificial Neural Network (ANN), Ensemble Model
Reviewing and Sampling the Dataset
The first order of business the group considered was the distribution of the target
variable: Customer_Type (2G or 3G). The distribution of 2G customer types compared to 3G
customer types was 15,000 records to 3,000 records in the training/testing dataset. The group
further reduced the number of records used by creating a balanced dataset consisting of 3000 2G
customers and 3000 3G customers. If this procedure was not completed, the dataset would have
been biased towards predicting everybody as 2G customers.
Data Partitioning, Variable Selection, Replacement, and Transformation
The group’s next step was to split the sample into 3 datasets: Training, Validation, and
Testing. The training dataset consisted of 60% of the data, validation dataset consisted of 30%
of the data, and the testing dataset consisted of the final 10% of the data. The group felt this split
would reduce the amount of overtraining in the dataset, and would provide a general idea of how
well the models we were creating were performing.
The group tried two methods of reducing the 251 variables which were included in the
datasets being used to create the models. Numerous methods were employed in reducing the
amount of information provided to the group: Manual Variable Selection, Decision Tree
Variable Selection, and the SAS Variable Selection tool. The SAS variable selection tool was
used to determine the variables used in the model because it provided the best results in the
initial models created to study the dataset, and provided the user with a workable number of
variables. Many of the variables were eliminated because they had a low correlation with the
Target Variable, Customer_Type. The final set of variables used in the modeling is shown in
Table One.
Table One: Variable Selection
Variable
AGE
MARITAL_STATUS
CUSTOMER_CLASS
LINE_TENURE
HS_AGE
HS_MODEL
LUCKY_NO_FLAG
LST_RETENTION_CAMP
LOYALTY_POINTS_USAGE
BLACK_LIST_FLAG
VAS_CND_FLAG
VAS_CNND_FLAG
VAS_AR_FLAG
AVG_DIS_1900
Variable
AVG_CALL_MOB
AVG_CALL_INTRAN
AVG_CALL_T1
AVG_VAS_GAMES
AVG_VAS_SR
AVG_PK_CALL_RATIO
AVG_OP_CALL_RATIO
AVG_M2M_MINS_RATIO
STD_BILL_SMS
STD_T1_CALL_CON
G_PAY_METD
G_PAY_METD_PREV
G_TOP1_INT_CD
G_TOP2_INT_CD
5
AVG_USAGE_DAYS
AVG_BILL_VOICED
G_TOP3_INT_CD
Replacement was not performed on the dataset, most of the variables were missing less
than 5% of their values, and the only significant variable which was missing over 50% of its
values, was excluded in the variable selection process.
Transformation was performed on the remaining variables in the early stages of
modeling. It was determined through testing many of the binning and normalization techniques
did not yield improved results when modeled. Therefore, it was decided by the group to forgo
the transformation process on final set of variables.
Modeling
Three basic models were run: Logistic Regression, Artificial Neural Network, and
Decision Trees. The Logistic Regression and ANN models consistently performed worse than
the three decision tree models even as measures were taken to try to improve their
misclassification rates (predictive ability).
Various parameters were changed in order to
improve the accuracy of these models, for example, the number of hidden nodes in the ANN
models. Random seed numbers were also changed to see the effect of this variable in our results
and to make sure no local minimums/maximums were observed in the ANN model.
One of the analyses we did amongst the early testing of decision trees was to determine if
the maximum number of branches to a node affected our results substantially. From our testing
we were able to ascertain the maximum of three branches per node yielded the best results. Once
our group decided to pursue the maximum three branch route, we discovered the various
Decision Trees yielded similar results, and we were able to improve their predictive rates when
various parameters were manipulated. Through testing, the parameters used in the final model
(an Ensemble Model of the Entropy Reduction and Gini Reduction models) were completed.
The parameters used in the decision trees are shown below in Table Two.
Table Two: Decision Tree Criteria
Table Three shows a comparison of the six main models and how they predicted which
2G customers who would switch to 3G. The prediction across training, validation, testing data is
fairly stable indicating there was little overfitting of the models. The Decision Trees consistently
performed five to ten percent better in predicting the 3G customers than the Logistic Regression
and ANN models. The four models utilizing Decision Trees were consistently around the twenty
percent misclassification rate and they performed better in the ROC chart (specificity and
sensitivity). When creating the Ensemble model, the individual models were grouped together in
6
various combinations to determine if any benefit was achieved through their coupling or staging.
Based on these observations, our group used the combination of two Decision Tree models:
Entropy Reduction and Gini Reduction to improve the results. Consequentially, we found that
the combination of these two models provided a better overall result than each of the models
created individually in terms of high accuracy, sensitivity, specificity, and misclassification rate
across all three datasets.
Table Three: Model Comparison
In the case of our group, we used multiple criteria to help us formulate which the “best
model”. The ROC chart (shown in Table Four below) was the basis for determining which of the
models performed best.
Table Four: ROC Chart
When viewing this chart, it is important to remember the greater the slope of the curve
when going from left to right, the more sensitive the model. In our case, the individual Decision
Tree models, and the Ensemble models clearly outperform the Regression and Neural Network
7
models. If judging on this criterion solely, we would choose the Ensemble model as the best
performer.
Another criterion we compared our models against was the Capture Response Rate
shown in Table Five. Once again the different Decision Tree models and Ensemble models
clearly outperformed the Regression and ANN models. The larger the space between the model
and the baseline, the greater the Captured Response Rate indicates the model(s) are performing
better than average. For the Asian Telecommunications company, the goal is to correctly predict
the 3G consumers in the Scoring dataset. Therefore, this graph must be used with great caution,
with the reminder this is for the classification of 2G and 3G, not 3G solely.
Table Five: % Captured Response
A third criterion to consider for selecting the correct model is to look at the Lift Chart as
shown in Table Six.
8
Table 6: Lift Chart Values
When viewing this chart, look for the point at which the individual Model Lift Value
lines cross the threshold of 1. At this point, the user can not obtain any significant lift in the
model. In the case of these models, the Regression and ANN model cross the baseline at a
greater percentile than the Decision Tree and Ensemble models. It is at the client’s discretion as
to whether this can be considered important. If the client only targets the top 30 percentile of its
consumer, then the models need to be judged at this percentile mark. In our case, we took this
into consideration, but felt this criterion would not greatly influence our decision of which of the
models to use. Using this criterion as the sole determinant for a successful model would end in
disappointment. When interpreted in conjunction with the Captured Response chart in Table
9
Five it becomes clears the Decision Trees are more effective at predicting whether the customer
will become a 3G user.
As you can see in the classification matrix in the above Table, the Ensemble model was
quite successful at predicting which category a customer would fall into. It excelled at predicting
the 2G customers, and did quite well at classifying the 3G customers. The Ensemble was a
relatively low rate of predicting 2G customers who actually became 3G customers.
Table 7: Classification Matrix
10
Conclusion
Using the various Analysis tools which SAS Enterprise Miner offers, we determined the
Ensemble Model, consisting of the Entropy and Gini Regression models, performed the best for
this particular business dataset and problem. We came to this conclusion by amalgamating
various criteria such as sensitivity, specificity, misclassification, and captured response. Based
on the management decision making process, each of these criteria can become the most
important determining factor in deciding which model or models to use.
11
Diagram of dataflow:
12
Download