Group Assignment I

advertisement
Predictive Modeling & Scoring
(PAKDD 2006 Data Mining competition)
By
ASIAN SQUARE (STL5)
Vinod Kanniah
Han-Mo
Oh
Vinod
Kanniah
Pimtong Tavitiyaman
Han-Mo Oh
Pawinee Boonyasopon
Pimtong Tavitiyaman
Oklahoma
University
PawineeState
Boonyasopon
Stillwater
Oklahoma-74075
United States America
Due date: March 1, 2006
3
Table of Content
Executive Summary …………………………………………………………….
3
Business Understanding ………………………………………………………...
4
Data Understanding……………………………………………………………..
4
Data Source
Data Set
Missing Value and Miscoding
Data Preparation…………………………………………………………………
5
Recoding and Data Type Adjustment
Missing Values
Data Adjustment
Modeling………………………………………………………………………...
7
Data Set Attributes
Sampling
Data Partition
Model Building Process
Evaluation………………………………………………………………………
8
Best Model
Rules in Decision Tree
Important Variables in the Final Model
Scoring
Deployment ………………………………………………………………………. 10
Conclusion ……………………………………………………………………...
11
Appendix ……………………………………………………………………......
12
2
Executive Summary
The sintrain dataset of the 2G and 3G customers given by PAKDD was used to predict the
customer types and the potential of the 2G customers changing from 2G to 3G networks. With the
better roaming and transferring data capabilities of the 3G network service there is a good
marketing opportunity to identify the potential 2G customers who may shift to the next generation
3G networks.
Our team analyzed the prediction based on the sintrain dataset provided by PAKDD. SAS
Enterprise Miner version 4.3 was used for the predictions. On initial exploration of the dataset we
observed that the variables were highly miscoded and had some less meaningful information for
the desired prediction. Hence, appropriate judgments were made to reject variables showing
redundancy, skew ness, high correlation and improper data entry. The retained variables were
replaced and transformed for the normal distribution.
Five different models were analyzed on three different datasets: the original dataset (217 input
variables), the dataset with variables retained after judgments made by our team (141 input
variables) and the dataset containing variables chosen by the team with the replacement and
transformations (141 input variables). A random seed of 2103 and a sample size of 6000
corresponding to 50% 2G and 50% 3G were used. The dataset was splitted into training (60%),
validation (30%) and testing (10%) with the same random seed. Each dataset was used to run the
five prediction models namely Logistic Regression, Decision Tree, Variable selection(ChiSquare/R-Square) followed by Artificial Neural Network/Decision Tree/ Logistic Regression,
Decision Tree followed by ANN/Logistic regression and Ensemble model.
Three models were evaluated based on the accuracy measures, such as classification accuracy,
sensitivity, and specificity. However, the differences of accuracy measures among these models
were insignificant. Thus, we compared the models by using lift charts and concluded that the
Decision Tree (C4.5 algorithm) as variable selection followed by ANN (MLP) model as the best
model for predicting the 3G customers. The accuracy, sensitivity and specificity of our final
model were 80.56 %, 74.19% and 86.90% respectively. We observed that this model would
accurately identify 57% of the 3G customers in the first 30% of the cumulative customers in the
database which was slightly higher than the other two models. Also, the profit was the highest in
this model than the other two considering the first 20 % of the cumulative customers.
The model numbers, average GPRS data utilization (Kbs) in last 6 months, the age of the handset
in months, Average Games utilization in the last 6 months, and the days to the contract expiry
were the major variables to predict the 3G customers. The manager may consider focusing these
particular customer market segments to improve the buying potential of the customers to 3G
network services. The model and the results are explained in detail in this report.
The suggestions made above are based on the business objective, data analysis and testing
different models. The manager should update this model after every campaign and make
necessary changes for future prediction. If the manager believes that more variables are to be
included or deleted based on his or her domain expertise then he or she may do so in the model.
We strongly recommend the company to manage the data properly in the future and update the
models based on the data available on that time. Also the results varied with different random
seeds thus suggesting the manager to change the same with the updated data while applying to the
population.
3
Our team followed the CRISP-DM steps to complete this project. The six major steps involved in
it are Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and
Deployment.
Diagram 1: CRISP-DM steps
Source: http://www.corporateintellect.com/services/crisp_dm02.htm
1. Business Understanding
The objective of this project was to use the customer usage and demographic data to accurately
identify the 3G customers in the scoring data set based on the training data set provided by
PAKDD and to predict the potential customers who may tend to switch the network from 2G to
3G. 2G wireless systems has three system standards; CDMA, GSM, and TDMA and transfers the
data at only 19.2 Kbps. On the other hand, 3G can transfer data rate to 2 Mbps, support large
multimedia files and applications, and increase roaming capabilities (Graven, 2006). Thus, the
company can use the results of our report to focus on the right target market and provide offers
for 3G network service promotion towards marketing efforts. Moreover, it will help the company
to compete in the highly competitive market.
The business objective of this study is to
1. Build a prediction model to accurately predict the buying intention of 3G customers
based on the training data set.
2. Employ the predicted model on the scoring data set to identify the 3G customer and to
investigate the possibility of the existing 2G customers changing to 3G network service.
3. Examine the behavioral characteristics (variables) of existing customers to improve the
marketing efforts on prospect customers.
2. Data Understanding
Our team used SAS Enterprise Miner version 4.3 for data exploration, analysis and modeling.
Data Source
We used the dataset provided by PAKDD. The training dataset had a target variable
(CUSTOMER_TYPE) to build a prediction model and hence identify the target variable in the
scoring data set. The sintrain dataset was the information of existing 2G and 3G customers.
4
Information such as customer’s demographic profiles, promotions, using behaviors, and network
usages (2G and 3G) were provided in this dataset.
Data Set
There were two datasets given. We used the sintrain dataset (15K 2G customers, 3K 3G
customers) for training, validation, testing and the sinscore dataset (5K 2G customers, 1K
3G customers) for scoring. Hence there were 18,000 records for training and 6,000 for
scoring. The data set (sintrain) had 251 variables. We observed that there were 38 categorical and
213 numeric variables. SAS identified the same as 79 class and 172 interval variables. The target
variable CUSTOMER_TYPE was coded either 2G or 3G. The percentage of 2G and 3G
customers in the sintrain data set were 83.33 % and 16.67 % respectively.
In order to better understand the data, we looked at the distribution of all the variables and
analyzed some of them by statistical tests for correlation, redundancy and significance based on
their means.
Missing Value and Miscoding
After exploring the data, we found that the dataset had missing values in 10 variables.
Table 1: Variables with Missing Values
Name of variable
Data Type % Missing
DAYS_TO_CONTRACT_EXPIRY
Interval
6%
AGE
Interval
3%
TOT_DEBIT_SHARE
Interval
2%
OCCUP_CD
Class
63 %
MARITAL STATUS
Class
6%
CONTRACT_FLAG
Class
6%
PAY_METHOD
Class
5%
PAY_METHOD_PREV
Class
5%
HS_MANUFACTURER
Class
3%
HS_MODEL
Class
3%
Some variables, however, were not mentioned properly. For example, the description of
AVG_CALL_MOB and AVG_MINS_MOB were described different from their variable names.
Furthermore, variables STD_VAS_QG, STD_VAS_QI, STD_VAS_QP, STD_VAS_QTXT and
STD_VAS_QTUNE were misspelled as STD_VAS_FG, STD_VAS_FI, STD_VAS_FP,
STD_VAS_FTXT and STD_VAS_FTUNE in their description. Also 30 variables were rejected
by default (in SAS EM version 4.3) since they were extremely skewed (unary: 24 variables,
ordinal: 3 variables, binary: 3 variables). However, more details of missing value, miscoding,
redundancy and many more are elaborated in the data preparation step.
3. Data Preparation
Data Preparation is a key step in CRISP-DM to predict the best model. After exploring the data
we found that we had to fix the outliers, redundant variables, miscoding, binning numeric
variables, grouping classes for categorical variables, variables having missing values and hence
applying appropriate transformations.
Recoding & Data Type Adjustment
On analyzing the sintrain dataset we found that the data types weren’t matching the data type in
the description table of the variables. Hence some data type adjustments were made. In the
5
training dataset, we observed that 36 categorical variables were miscoded and hence we recoded
them as binary (20) and nominal (16) variables.
Missing Values
Missing values seriously affect the data mining results as they may give invalid and unreasonable
results. Hence, we looked at the variable distribution and made the decision of either replacing
the missing data or rejecting it. For example, variable OCCUP_CD, a nominal variable had 63%
missing data. Even though, we used replacement technique to solve this issue, we found that it
still wasn’t staying in the final prediction model. As a result, we rejected variable OCCUP_CD
from the input data source.
Data Adjustment
In the beginning there were 249 variables in the sintrain dataset. However, we rejected many
variables based on various factors as described below. The rejected variables are tabulated in
appendix 1. We thus retained certain variables by analyzing their importance in predicting the
target variable. Our group finally rejected 109 variables and retained 141 variables in the input
data source to run different models.
There were six reasons analyzed by our team for rejecting certain variables.
1). Rejected by default in SAS EM version 4.3 because of extreme skew ness
There were 24 unary, 3 ordinal and 3 binary variables rejected by default in SAS EM version
4.3. Extremely skewed variables don’t provide any difference among the data hence it has no
effect on predicting the target variable.
2). Rejected based on t-test
SPSS version 14.0 was used to find the statistical significance of variables based on t-tests. 19
variables were rejected based on t-test, since there were significant differences in mean having a
p-value greater than 0.05.
3). Rejected based on skew ness (our team decision)
Our group decided to reject 8 binary variables with skew ness over 98 %. After repeated trials
we figured that these variables were not included in any of the prediction models built and hence
rejected.
4). Rejected considering minutes is a better measure than calls (our team decision)
We observed some redundant variables that would describe the same. We believed that the
number of calls had a positive relation with minutes, so we decided to use just one type of
variable for prediction. For example, we decided to use variable AVG_MINS as the input
variables while rejecting AVG_CALL. We also rejected the standard deviation of the call
variables. Finally we rejected 36 variables under this category.
5). Rejected because of redundancy in calculation and measurement
Some variables were rejected based on redundancy. For example, the variable
REVPAY_PREV_CD used a certain kind of coding for variable AVG_REVPAY_AMT. Hence,
we decided to refer only variable AVG_REVPAY_AMT for prediction. Also, we used average
variables like AVG_DIS_1900 and AVG_USAGE_DAYS for prediction instead of the total
variables TOT_DIS_1900 and TOT_USAGE_DAYS. Also some variables were sum of certain
variables (e.g. A = B + C). So, we ran different prediction models to see the importance of these
variables and concluded that A was significant enough to predict the target variable. An example
of this variable was AVG_BILL_VOICE which was the sum of AVG_BILL_VOICED and
AVG_BILL_VOICEI. The same reasons were with the standard deviation of these variables.
6
6). Rejected because of improper data entry
Variable AVG_CALL_MOB was described as the average number of mobile minutes in the
last six month and AVG_MINS_MOB as the average number of mobile calls in the last 6 month.
We observed that there were some errors in the data entry and hence rejected these variables. The
STD_CALL_MOB and STD_MINS_MOB were also rejected based on the same reasons.
4. Modeling
Our team used SAS Enterprise Miner version 4.3 to build prediction models.
Data set attributes
We specified CUSTOMER_TYPE as the target variable. In addition to the automatic attribute
setting by SAS, we stated the role of variables based on their importance in prediction as
explained in data exploration. Thus we included 141 total input variables into the prediction
models.
Sampling
The stratified method of sampling was used and the variable CUSTOMER_TYPE was set to use
in the status tab. A random seed of 2103 and a sample size of 6000 corresponding to 50% 2G and
50% 3G were used. We could have used any number less than 6000. But since only 3000 records
were available for 3G, we used the entire 3000 records in the sample data set for prediction.
Data partition
In order to build the model, the dataset were split into a proportion of training (60%), validation
(30%) and testing (10%) by using a random seed of 2103.
We tried different combinations of data partition (70:20:10, 60:30:10, 50:40:10) for the data split.
We also tried different random seeds and finally concluded after trying different models that seed
2103 and partition of 60:30:10 were the best among the random seeds and percent partition
chosen.
Model Building Process
We started the model building process with the default data set given by PAKDD. The tree
describing the model building process adopted by our team to come up with the final model is
shown in appendix 2.
1. Default models: 217 input variables, 5 prediction models
2. Models after manually rejecting variables: 141 input variable, 5 prediction models
3. Models after replacements and transformations;141 input variables, 5prediction models
Out team used five models of data mining techniques for prediction
 Logistic Regression (LR)
 Decision Tree (DT)
a) CHAID algorithm (Chi-square, alpha= 0.05)
b) C4.5 algorithm (Entropy reduction)
c) CART algorithm (Gini reduction)
 Variable Selection (Decision Tree) followed by
a) Logistic Regression
b) Artificial Neural Network (MLP)
 Variable Selection (Chi-square/R-square) followed by
a) Logistic Regression
b) Decision Tree
7

c) Artificial Neural Network (MLP)
Ensemble Model (a combination of the above 4)
We built all the five prediction models as above and looked on their misclassification rates,
accuracy, sensitivity and specificity. The misclassification rates, accuracy, sensitivity and
specificity of the 5 models in the default dataset are shown in appendix 4. Then we used the data
set with the variables retained after rejecting 109 variables based on the team’s decision. We
observed that the prediction accuracy, sensitivity and specificity improved. The results for this
model are shown in appendix 5. Finally we ran the prediction models with appropriate missing
value replacement techniques and transformations. The appropriate transformation of variables of
the best model is shown in appendix 7. We again observed an increase in the measures of
accuracy, specificity and sensitivity as shown in appendix 6. The node diagram for building this
model is shown in appendix 3. We thus selected 3 models based on the mean accuracy,
sensitivity and specificity.
5. Evaluation
In order to evaluate the best models, we decided to analyze the misclassification rates of the test
dataset and lift charts of the models. The models were evaluated based on the accuracy measures,
such as classification accuracy, sensitivity, and specificity. The results were achieved using crossvalidation for each model, and are based on results obtained from the validation dataset for the
same. Once the confusion matrixes were constructed, the accuracy, sensitivity, and specificity of
each model were calculated using the respective formulas.
Accuracy = (TP+TN) / (TP+FP+TN+FN); sensitivity = TP / (TP+FN); specificity = TN /
(TN+FP), where TP, TN, FP, and FN denotes true positives, true negatives, false positives, and
false negatives, respectively.
BEST MODEL
Three models were chosen based on the mean accuracy, mean sensitivity and mean specificity.
C4.5: Decision Tree (C 4.5 entropy algorithm)
MLP by s: Variable selection method by Chi Square followed by Artificial Neural Network
algorithm
MLP by C: Variable selection method by Decision tree (C4.5 entropy algorithm) followed by
Artificial Neural Network algorithm
For the three best model types, the detailed prediction results of validation datasets are presented
below.
Table 2: The comparison of prediction results of validation datasets
Measures on
DT (C4.5)
DT (C4.5) variable
Chi Square variable
validation
selection followed by selection followed by
ANN (MLP)
ANN (MLP)
0.8117
0.8056
0.8050
Accuracy
0.7898
0.7419
0.8231
Sensitivity
0.8355
0.8690
0.7869
Specificity
We found that the ANN model by variable selection using C4.5 algorithm achieved a
classification accuracy of 0.8056 with a sensitivity of 0.7419 and a specificity of 0.8690. The
ANN model by variable selection using χ2 achieved a classification accuracy of 0.8050, with a
sensitivity of 0.8231 and a specificity of 0.7869. Moreover, the decision tree (C4.5) performed
8
the best of the three models evaluated. The decision tree (C4.5) achieved a classification accuracy
of 0.8117 with a sensitivity of 0.7898 and a specificity of 0.8335.
However, the differences of accuracy measures among these models are insignificant. Thus, we
compared the three models by using lift chart as shown in appendix 8. We observed that 30% of
cumulative customers from the MLP by C model identified almost 57% of the 3G customer type
which was slightly higher than the other two models. Also, the profit was the highest in the MLP
by C model than the other two considering the first 20 % of the cumulative customers. From the
lift value chart of MLP by C model, for 47% of predicted probability, the 3G customers are equal
or lower than what we would expect if we were to take a random sample. Thus we conclude that
the DT (C4.5) variable selection followed by ANN (MLP) model as the best model for
predicting the 3G customers.
Rules in Decision tree
In order to select some variables, we estimated the decision tree by using C4.5 algorithm that
splits the input observation into two or more subgroups. This step had been repeated at each leaf
node until the complete tree was constructed. The root node was split according to HS_MODEL,
resulting in the first level. A total of 845 samples with handset model values, such as 11240,
10829, 10680, etc., were grouped into node 1. In contrary, 2755 samples with the other
HS_MODEL values were grouped into node 2. Node 2 was then split according to the handset
age in months (HS_AGE), yielding the second level of the tree. Grouped into node 3 were 677
samples with HS_AGE smaller than 3.5 and 2978 samples with the age values greater than or
equal to 3.5 were grouped into node 4. Node 3 was split according to the
DAYS_TO_CONTRACT_EXPIRY values. Grouped into Node 5 were 471 samples with the
number of days to handset contract expiry date of less than 6.5. Whereas grouped into node 6
were 206 samples with number of days greater than or equal to 6.5. According to the current
subscription plan type (SUBPLAN), node 4 was split into two parts: 518 samples of node 7 with
2102, 2136, 2248, etc., and 1560 sample of node 8 with the other values of the SUBPLAN. From
node 6, HS_MODEL was used for splitting; thus, grouped were 104 samples of node 9, while
grouped were 102 samples of node 10 with the HS_MODEL values, such as 10828, 10809,
10660, 11234, etc., Then from node 7, by the criterion of HS_MODEL, two nodes were split:
node 11 and node 12. According to the average games utilization (Kb) last 6 months
(AVG_VAS_GAMES), both node 13 and node 14 were spilt from the node 10. The criterion of
splitting was 932203. Based on the criterion 150198.25 of the average GPRS Data utilization (kb)
in last 6 months (AVG_VAS_GPRS), node 15 and node 16 were split from node 11. Finally,
node 17 and node 19 were created by the previous subscription plan type from node 15.
Accordingly, seven variables were selected: HS_MODEL, HS_AGE, SUBPLAN,
DAYS_TO_CONTACT_EXPIRY,
SUBPLAN_PREVIOUS,
AVG_VAS_GPRS,
and
AVG_VAS_GAMES.
Important variables in the final model
After selecting variables by C4.5 algorithm, we created an ANN model with multilayer
perceptron (MLP). We ran the ANN with 1 hidden layer and 1 hidden Neuron. The ANN model
by variable selection using C4.5 decision tree algorithm provided information about the relative
importance of the input variables in predicting customer type. The analysis performed the input
variables by their relative importance. As shown in the appendix 9, the variable
HS_MODEL_10214 was by far the most important variable in the prediction of the customer
type. The information contained in this variable is concerned with the customer type. The second
most important variable in the prediction was AVG_MIS0001, which was transformed from
AVG_VAS_GPRS. The third most important variable was HS_A_70W0001:low, which was
transformed from HS_AGE. Eventually, 11 variables were critical on the prediction of customer
9
type: HS_MODEL 10328, HS_MODEL 10326, HS_MODEL 10324, HS_MODEL 10316,
HS_MODEL 10311, HS_MODEL 10214, HS_MODEL 10211, HS_A_70W0001:low,
AVG_MIS0001:low, AVG_60QC0001:low, which was transformed from AVG_VAS_GAMES,
and DAYS_TO_CONTRACT_EXPIRY.
Scoring
The best model (DT (C4.5) variable selection followed by ANN (MLP)) chosen above was
connected to the scoring data set provided by PAKDD. By using the scoring node in SAS EM
version 4.3, we determined the percentage of 3G and 2G customers as 77.32 % and 22.68 %
respectively.
Thus 22.68 % of the scoring data set would be 3G customer types.
A tab delimited file named STL5.txt of the scored dataset is attached with this report.
6. Deployment
For the business implication, if the company would like to spend money for the marketing
campaign, it may consider this model for the same. In the training dataset there were 83.33 % of
2G customers and 16.67 % of 3G customers. However, our team selected the best model and
classified the type of customers in the scored dataset as 77.32% 2G and 22.68% 3G customers.
The manager should update this model after every campaign and make necessary changes for
future prediction. If the manager believes that more variables are to be included or deleted based
on his or her domain expertise then he may do so in the model.
The best model was crosschecked on the sintrain dataset to check the sensitivity and accuracy of
the model on the original dataset with 18, 000 records. The sensitivity and accuracy of our model
in the sintrain dataset were 79.5 % and 84.05% respectively.
10
Financial terms
From the 18,000 customers dataset we know the exact number of 3G customers types are 3000.
Hence, assuming the cost of adverting is $1 per mailing, and we make $10 on each responds.
Total cost would be $1 * 18,000 = $18,000
The revenue is $10 * 3,000 = $ 30,000
Thus profit is $ 12,000
However, if we perform the marketing campaign based on our best model
Total cost would be $1*4.640 = $ 4,640
The revenue is $10 * 2,385 = $23,850
Thus profit is $ 19,210
Hence the company makes a profit of $7210.Thus the expenses will be decreased if the model
predicted the correct customer market segments.
Thus, in order to deploy data mining results in the business, the company has to monitor and
maintain the data properly. Also, if there is a change in technology or network, the company
should determine a new suitable prediction model. This will help the company know the new
trend at that time and apply the right strategy to compete in the market.
Conclusion
A prediction model was built on the sintrain data set provided by PAKDD to predict the 3G
customer types. A decision tree with C4.5 algorithm as a variable selection node followed by a
Artifice Neural Network model (Multilayer perceptron) was the best model based on accuracy,
sensitivity, specificity and lift charts. The handset model numbers, average GPRS data utilization
(Kbs) in last 6 months, the age of the handset in months, average Games utilization in the last 6
months, and the days to the contract expiry were the major variables to predict the 3G customers.
The manager may consider focusing these particular customer market segments to improve the
buying potential of the customers to 3G network services.
The 3G customers predicted in the holdout (sinscore) dataset were 22.68 % of customers. The tab
delimited file of the holdout sample named STL5.txt is attached with this report.
11
APPENDIX
Appendix 1
Rejected variables
Rejected Reasons
1. Rejected by SAS default
(unary and ordinal)
Total
Variables
30
2. Rejected based on t-test
19
3. Rejected based on skew ness
of binary variables (> 98%) :
our team decision
8
4. Rejected calls assuming that
minutes is a better measure
than calls and standard
deviations for the same : our
team decision
36
5.
Rejected because
redundancy in calculation
of
9
6.
Rejected because
improper data entry
of
4
Items
HS_CHANGE, TELE_CHANGE_FLAG,
TOT_PAST_DEMAND, VAS_VMN_FLAG,
VAS_VMP_FLAG, VAS_INFOSRV, VAS_SN_FLAG,
VAS_CSMS_FLAG, DELINQ_FREQ, AVG_VAS_QI,
AVG_VAS_QP, AVG_VAS_QTXT, AVG_VAS_IDU,
AVG_VAS_WLINK, AVG_VAS_MILL,
AVG_VAS_IFSMS, AVG_VAS_123_,
AVG_VAS_CG, AVG_VAS_IEM, AVG_VAS_ISMS,
AVG_VAS_SS, STD_VAS_QI, STD_VAS_QP,
STD_VAS_QTXT, STD_VAS_IDU,
STD_VAS_WLINK, STD_VAS_MILL,
STD_VAS_CG, STD_VAS_IFSMS, STD_VAS_123_,
STD_VAS_IEM, STD_VAS_ISMS, STD_VAS_SS
TOT_PAST_TOS, TOT_TOT_DAYS, OD_REL_SIZE,
REVPAY_FREQ, AVG_VAS_QG,
AVG_VAS_QTUNE, AVG_VAS_WAP,
AVG_VAS_ESMS, AVG_VAS_GPSMS,
AVG_VAS_YMSMS, AVG_EXTRAN_RATIO,
AVG_SPHERE, STD_VAS_QG, STD_VAS_CWAP,
STD_VAS_ESMS, STD_VAS_GPSMS,
STD_VAS_YMSMS, STD_CALL_FRW_RATIO,
STD_BUCKET_UTIL
ID_CHANGE_FLAG, VAS_CND_FLAG,
VAS_CNND_FLAG, VAS_DRIVE_FLAG,
VAS_FF_FLAG, VAS_IB_FLAG, VAS_NR_FLAG,
VAS_IEM_FLAG
AVG_CALL, AVG_MINS_OB, AVG_CALL_OB,
AVG_MINS_IB, AVG_CALL_IB, AVG_MINS_PK,
AVG_CALL_PK, AVG_MINS_OP, AVG_CALL_OP,
AVG_CALL_FIX, AVG_CALL_INTRAN,
AVG_CALL_EXTRAN, AVG_CALL_LOCAL,
AVG_CALL_T1, AVG_CALL_EXTRANTS,
AVG_CALL_INT, AVG_CALL_FRW,
AVG_CALL_1900, STD_CALL, STD_MINS_OB,
STD_CALL_OB, STD_MINS_IB, STD_CALL_IB,
STD_MINS_PK, STD_CALL_PK, STD_MINS_OP,
STD_CALL_OP, STD_CALL_FIX,
STD_CALL_INTRAN, STD_CALL_EXTRAN,
STD_CALL_LOCAL, STD_CALL_T1,
STD_CALL_EXTRANTS, STD_CALL_INT,
STD_CALL_FRW, STD_CALL_1900
REVPAY_PREV_CD, TOT_DIS_1900,
TOT_USAGE_DAYS
AVG_BILL_VOICED, AVG_BILL_VOICEI,
AVG_VAS_ARC
STD_BILL_VOICED, STD_BILL_VOICEI,
STD_VAS_ARC
AVG_CALL_MOB, AVG_MINS_MOB,
STD_CALL_MOB, STD_MINS_MOB
12
Appendix 2: Model Building Tree
Model Building Tree
Model after reject
variables
Default Models
L
R
D
T
VS
(DT)
VS
(χ2)
VS
(R2)
E
M
L
R
D
T
VS
(DT)
VS
(χ2)
VS
(R2)
LR
LR
Model after reject,
replace, and transform
variables
E
M
L
R
D
T
VS
(DT)
VS
(χ2)
MLP
LR
MLP
LR
LR
MLP
LR
MLP
LR
LR
CHAID
MLP
CART
LR
C4.5
VS
(R2)
CHAID
CART
MLP
MLP
C4.5
3
CHAID
CART
C4.5
MLP
E
M
MLP
Appendix 3
Final node diagram for the models built by appropriate replacement and transformations
3
Appendix 4
Default Models
Data Mining
Technique
LR
Sensitivity
Specificity
R2
EM
C4.5
CART
LR
ANN
(MLP)
LR
ANN
(MLP)
DT
(CHAID)
LR
ANN
(MLP)
DT
(CHAID)
0.2483
0.1927
0.2013
0.1991
0.2430
0.2425
0.2525
0.2383
0.1977
0.2488
0.22
0.2036
0.1841
0.26
0.2211
0.035
0.235
0.2488
0.2477
0.2611
0.2477
0.2122
0.2638
0.2522
0.2172
0.2027
Test
0.2616
0.2533
0.2616
0.2516
0.2416
0.255
0.2466
0.2483
0.2466
0.24
0.2583
0.2583
0.205
Validation
0.7400
0.7788
0.7644
0.765
0.7511
0.7522
0.7388
0.7522
0.7877
0.7361
0.7477
0.7827
0.7972
Test
0.7383
0.7466
0.7383
0.7483
0.7583
0.745
0.7533
0.7516
0.7533
0.76
0.7416
0.7416
0.795
Validation
0.6840
0.7619
0.7708
0.7717
0.6740
0.6440
0.7463
0.733
0.7775
0.7419
0.7163
0.7897
0.7764
Test
0.6913
0.7491
0.7717
0.7519
0.7170
0.6688
0.7877
0.7524
0.7524
0.7877
0.7299
0.7620
0.7877
Validation
0.7957
0.7957
0.7580
0.7780
0.8279
0.8601
0.7314
0.7713
0.7980
0.7302
0.7791
0.7758
0.8179
Test
0.7889
0.7439
0.7024
0.7231
0.8027
0.8269
0.7162
0.7508
0.7543
0.7301
0.7543
0.7197
0.8027
Train
Validation
Accuracy
χ2
DT (CHAID)
CHAID
Values
Misclassification
Rate
DT
3
Appendix 5
Model after Rejecting Variables Based on the Rules Set up by Our Team
Data Mining
Techniques
Values
Misclassification
Rate
Accuracy
Sensitivity
Specificity
LR
DT
Variable
Selection
DT(C4.5)
Variable Selection
χ2
EM
R2
CHAID
C4.5
CART
LR
MLP
LR
MLP
DT
(C4.5)
LR
MLP
DT
(C4.5)
0.17
0.1552
0.1502
0.1705
0.1805
0.1688
0.1766
0.1666
0.1569
0.1808
0.1916
0.1719
0.1319
Validation
0.2333
0.2
0.1861
0.1988
0.23
0.2222
0.23
0.2311
0.1872
0.2316
0.2311
0.2011
0.1738
Test
0.255
0.241
0.21
0.2333
0.2483
0.245
0.2083
0.235
0.22
0.2283
0.255
0.2083
0.1966
Validation
0.7667
0.7900
0.8139
0.8011
0.7689
0.7778
0.7694
0.7689
0.8128
0.7683
0.7689
0.7989
0.8261
Test
0.745
0.7583
0.79
0.7666
0.7516
0.755
0.7616
0.765
0.78
0.7716
0.745
0.7916
0.8033
Validation
0.6819
0.7608
0.7853
0.8209
0.7642
0.7275
0.7575
0.7642
0.7864
0.7119
0.6808
0.7542
0.7686
Test
0.6655
0.7620
0.7620
0.8102
0.7427
0.7170
0.7684
0.7781
0.7588
0.7202
0.6559
0.7427
0.7588
Validation
0.8513
0.8191
0.8424
0.7736
0.7736
0.8280
0.7814
0.7736
0.8391
0.8246
0.8568
0.8435
0.8834
Test
0.8304
0.7543
0.8200
0.7197
0.7612
0.7958
0.7543
0.7508
0.8027
0.8269
0.7408
0.8442
0.8512
Train
3
Appendix 6
Models after Rejecting Variables Based on the Rules, Performing Missing Value Replacements, and Appropriate Transformation
Data Mining
Techniques
Values
Misclassification
Rate
Accuracy
Sensitivity
Specificity
LR
DT
Variable Selection
Variable Selection
χ
2
DT(C4.5)
EN
R
2
CHAID
C4.5
CART
LR
MLP
LR
MLP
DT
(C4.5)
LR
MLP
DT
(C4.5)
Train
0.145
0.1616
0.1580
0.1769
0.1580
0.1516
0.1577
0.2125
0.1855
0.1736
0.1691
0.1855
0.2022
Validation
0.1944
0.1994
0.1883
0.2066
0.2083
0.1944
0.210
0.2372
0.215
0.2111
0.195
0.2211
0.2361
Test
0.2083
0.22
0.2216
0.24
0.205
0.2166
0.2083
0.2483
0.235
0.2266
0.2216
0.245
0.2666
Validation
0.8039
0.8056
0.8117
0.7933
0.7917
0.8056
0.7889
0.805
0.7789
0.7900
0.7628
0.785
0.7639
Test
0.7917
0.78
0.7783
0.76
0.795
0.7833
0.7734
0.7783
0.755
0.7917
0.7517
0.765
0.7333
Validation
0.743
0.7731
0.7898
0.8398
0.7508
0.7419
0.7453
0.8231
0.7864
0.7375
0.7308
0.8365
0.6463
Test
0.7395
0.7363
0.7588
0.8135
0.7588
0.7203
0.7428
0.8103
0.7814
0.7524
0.7428
0.8264
0.6238
Validation
0.8646
0.828
0.8335
0.7469
0.8324
0.869
0.8324
0.7869
0.7714
0.8424
0.7947
0.7336
0.8812
Test
0.8478
0.827
0.7993
0.7024
0.8339
0.8512
0.8062
0.7439
0.7266
0.834
0.7612
0.699
0.8512
17
PAKDD 2006
OSU-Group 5
Appendix 7
Transformation of variables in the final best model (Variable selection method by Decision
tree (C4.5 entropy algorithm) followed by Artificial Neural Network algorithm)
Variable
Before transformation
Transformation
After transformation
name
used
HS_AGE
Binning
Buckets = 2
AVG_VAS
_GAMES
Binning
Buckets = 2
AVG_VAS
_GPRS
No effective
transformation
could be done.
DAYS_TO_
CONTRAC
T_EXPIRY
No effective
transformation
could be done.
3
Appendix 8
Lift Charts
C4.5: Decision Tree (C 4.5 entropy algorithm)
MLP by s: Variable selection method by Chi Square followed by Artificial Neural Network
algorithm
MLP by C: Variable selection method by Decision tree (C4.5 entropy algorithm) followed by
Artificial Neural Network algorithm
Lift chart of % captured response on validation data- cumulative
Lift chart of lift value on validation data-non cumulative
19
Lift chart of profit on validation data- cumulative
Appendix 9
Results from the best model DT (C4.5) variable selection followed by ANN (MLP
20
Download