Super Computer Data Mining Project Entry for the 2006 PAKDD

advertisement
Super Computer Data Mining Project Entry for the 2006 PAKDD Data
Mining Competition: Heterogeneous Classifier Ensemble Approach
A. J. Bagnall1 G. Cawley1
L. Bull2
1
School of Computing Sciences
University of East Anglia, Norwich, UK
2
School of Computer Science, University of West
of England, Bristol, UK
1. Introduction
This document describes the entry of the Super Computer Data Mining (SCDM)
Project to the PAKDD 2006 data mining competition. The SCDM project is
sponsored by the Engineering and Physical Sciences Research Council of the UK
government (funded under grants GR/T18479/01, GR/T18455/01 and
GR/T/18462/01) and began in January 2005. The objective of the project is to
develop data mining tools, implemented in C++ and MPI, for parallel execution on
super computer architecture. The main super computer facility we use is based at the
University of Manchester and is called CSAR It forms part of the UK high
performance computing (HPC) service. However, the SCDM code will run on any
cluster and will be freely available to the academic community. The SCDM toolkit
has already contributed to several research projects with six papers published
[1,2,3,4,5,6] and more in preparation. More details can be found at the project website
(still under development) [7]. Our motivation for this project is to develop tools that
will be able to perform complex analysis of very large data with many attributes. We
have assembled several large attribute, many cases data sets to form a standard test
suite for the assessment of data mining algorithms on this type of data [8]. The main
algorithmic focus of the project is on ensemble techniques, with a particular emphasis
on attribute selection. This competition has been particularly useful for us for several
reasons
1. The PAKDD competition data set will make a useful addition to the data
collection.
2. It provides a test bed for our implementations of existing algorithms (for this
work k-NN, C4.5, Naïve Bayes, Neural Network and Logistic Regression)
3. It allows us to assess new variations of classifiers (Learning Classifier Systems)
and ensemble algorithms (FASBIR)
2. Approach and Understanding of the Problem
The 2006 PAKDD data mining competition involves building classifiers to predict
whether customers will choose a 3G or 2G phone. Our approach to this type of
problem is normally to adopt a data mining methodology such as Cross Industry
Standard Process for Data Mining (CRISP) [9]. Our ability to gain a good business
understanding through interaction with the customer is obviously limited due to the
nature of the competition. Nevertheless, a structured approach is always beneficial.
Business Understanding
Our only source of business understanding other than the data itself comes from the
competition website.
“An Asian telco operator which has successfully launched a third generation (3G)
mobile telecommunications network would like to make use of existing customer
usage and demographic data to identify which customers are likely to switch to using
their 3G network.”
There is also some indication of the preferences of the customer in terms of the
classification loss function.
“The data mining task is a classification problem for which the objective is to
accurately predict as many current 3G customers as possible (i.e. true positives) from
the “holdout” sample provided.”
This indicates that the objective may be to target marketing towards potential
customers in a way where the cost per individual is small, hence the priority is to not
miss those most likely to want 3G. It is of course trivial to maximize the true
positives by simply predicting everyone as 3G. However, such a solution is obviously
of no interest. A proper modeling of the situation would involve a cost function for
those both interested and not interested in 3G. The choice of cost function will
obviously affect the ranking of our final entry. Given the stated assessment criteria
“Entries will be assessed in terms of the number of 3G customers that are correctly
classified by their model. A ranking table of the entries will be made available on
this website in April 2006.”
we assume a low cost for false positives and high profit for true positives. Since the
true costs are unknown we also present several different cost scenarios in the analysis
section, based on the training data.
Data Understanding
There are 18000 cases in the training data and 6000 in the test data, each case is an
individual phone user. Of the 250 attributes 37 are categorical and are 213 numeric.
All apart from the basic demographics (sex, age, nationality and marital status) relate
to phone use. Of the usage attributes, there are average and standard deviations for 90
features. Many of these features are obviously related and correlated (for example,
Average number of off-peak minutes and Average number of off-peak calls), and
domain knowledge could be usefully employed to derive features (such as heavy user
or international traveller). However, it is dangerous and probably fruitless to do so
without consultation with domain experts. Hence we only use automated attribute
transformations.
Data Preparation
The data requires significant pre-processing:
Missing values: In addition to those indicated in the dictionary, there are several
missing values with =#VALUE! Key, suggesting there may be some failed
calculation in preprocessing. We have replaced these with missing value indicators.
21 attributes are all zeros. This should also be investigated. We have removed these
attributes completely.
Missing attribute values: several attribute values appear only in the training data or
the testing data. For all attributes we
Group all in train but not in test as “other”
Change all in test but not in train as “missing”
Cluster discrete attributes with a large number of classes. Many of the categorical
variables have attribute values with very few observed values. We have clustered
these together as dictated by the data. A full description is given in Appendix A. We
also provide class distributions for some of the attributes to provide an informal idea
of the discriminatory power. Note the most important of these are subplan and
handset. In the given format these attributes have too many possible values to be of
much use, but after formatting they prove highly indicative.
Transformation of continuous attributes. Many of the continuous attributes have
highly skewed distributions and there is a high degree of multi-collinearity. Given the
lack of domain knowledge, we have taken three approaches to the continuous fields to
reflect these characteristics.
1.
Leave them as they are (after preprocessing described in Appendix A): File
Mixed.csv (Version4_1.csv) contains all the formatted discrete attributes and the
continuous fields as provided.
2.
Discretise the continuous with the MDL method. File AllDiscrete.csv
(Version4_2.csv) contains only discrete attributes
3.
Transform into principle components, retaining only the components that explain
95% of the variation in the data. File MixedPCA.csv (Version4_4.csv) contains the
formatted discrete attributes and the 67 principle components.
Furthermore, we have derived binary dummy attributes for the discrete fields and thus
created two new files for algorithms that are best suited to problems with only
continuous attributes. These are AllContinuous.csv (Version4_3.csv) and
AllContinuousPCA.csv (Version4_5.csv)
3. Modelling phase: details of the classification model that was produced
Our approach is to produce a probability estimate using an ensemble of classifiers,
some of which may themselves be ensembles. The core classifiers used are:
1. Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR)
[10] is an ensemble k nearest neighbour algorithm that filters by information gain and
randomizes with distance metrics and attribute subsets. We use the parameter values
and implementation of FASBIR described in [6]. The final output of FASBIR is in
fact the result of an ensemble of 100 alternative k-NN classifiers. FASBIR is run on
AllContinuous.csv and AllContinuousPCA.csv
2. C4.5 Decision tree. Our C4.5 is a standard implementation comparable to the
WEKA version. Based on past experience, we set the minimum leaf node size of C4.5
to 50. This is a crude but effective way of avoiding over fitting. C4.5 is run on
Mixed.csv and AllDiscrete.csv
3. Naïve Bayes. Standard implementation assuming normality for real valued
attributes. NB is run on AllDiscrete.csv and MixedPCA.csv
4. Logistic regression. Parameters estimated with a standard maximum likelihood
estimation technique. AllContinuous.csv and AllContinuousPCA.csv
5. Neural Network. The NN has a single hidden layer, initially containing 32 or 64
neurons. The output layer utilised a Softmax activation function with a cross-entropy
error metric and a standard 1-of-c encoding system. A Bayesian regularisation
scheme was used to avoid overfitting, adopting a Laplace prior. AllContinuous.csv
and AllContinuousPCA.csv
6. Learning Classifier System. LCS as described in [5], using an error weighted
fitness function, a niched genetic algorithm and a Q-Learning style Widrow-Hoff
update rule. AllDiscrete.csv
This gives us 11 classifiers in the meta-ensemble. Each of the 11 classifiers produces
a probability estimate for the training cases. We use a weighted mean of these
probability estimates as our final prediction. The weights are determined by using a
10 fold cross validation and estimating the accuracy of each classifier. Once a
probability estimate is obtained for each training case, a prediction is made using the
following profit matrix.
Profit Matrix
Actual
Predicted
3G (true)
2G (false)
3G (positive)
45 (TP)
-5 (FP)
2G (negative)
0 (TN)
0 (FN)
This is designed to simulate the idea of making marketing decisions based on the
classifier: The cost of classifying as 3G is associated with the cost of a mail shot. If
the individual is classified as 3G (positive) we assume a mail shot costing 5 units will
be sent, and if not, no action will be taken (at 0 cost). We assume that if the mail shot
reaches the 3G customer they will respond (true positive) and result in a profit of 50,
whereas if they do not respond (FP) there is a loss of 5. Thus to decide whether to
classify as 3G we take the action that maximizes the expected return. Of course, this
may not be the intended use for the analysis, but a simple decision theory framework
at least provides some structure to justify the classifications we make.
Training Results
The training data results indicate the resulting classifier is effective, although it may
be based on information that is not of great use in predicting whether individuals will
switch. Predicting whether someone uses a 3G phone based on their phone usage is
not the same as predicting whether someone will switch from 2G to 3G. However,
for the former task, we believe accurate predictions can be made. The contingency
matrix below gives an accuracy of 87% and a balanced error rate of 14%.
Predicted
2G
3G
Grand
Total
Actual
2G
13197
1800
3G
14997
491
2509
Grand Total
13688
4309
3000
17997
We can get higher accuracy by changing the cost function. If we change our cost
function to +5 for true positive we get the following outco
me, essentially trading accuracy on 2G for 3G.
Actual
Predicted
2G
3G
Grand Total
ACC
2G
14459
538
14997
91.62%
3G
970
2030
3000
BER
Grand
Total
15429
2568
17997
17.96%
Our entry for the test data predicts nearly half the customers as 3G. This
overweighting towards 3G is because of the competition assessment criteria, which is
to judge entries on the number of true positives. A more balanced (and hence more
accurate) entry could easily be made.
The ROC curve demonstrates how this decision boundary changes the outcome.
Altering the reward moves the decision boundary, and hence the point on the ROC
curve. Due to the lack of expertise, we have chosen in an ad hoc fashion based on our
interpretation of the objectives.
ROC Curve
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
4. Evaluation phase: discussion on what insights can be gained from their model
Table 1 shows the ranking of the top 15 attributes by Information Gain, Information
Gain Ratio and Chi-Squared Statistic. Note the importance of Handset model and age
(after preprocessing). We would expect handset to be highly predictive of 3G, given
that most phones are either 3G or 2G.
However, there is variation in 2G/3G with handsets, so presumably some phones can
be used in both contexts. Also of high discriminatory power is handset age, which is
again not surprising, as 3G handsets have not been in production as long.
Subplan is important (although incomprehensible to us) as is the amount spent and
the amount of calls made. Perhaps interestingly, the GAMES fields appear high in the
ranking, indicating that game players have a higher likelihood of using 3G services.
We could very simplistically characterize a 3G user as a heavy user with a fairly new
phone who plays games. These are not surprising results (this report should be seen as
a preliminary investigation rather than a complete case study) but it does indicate that
customer modeling could yield user profiles that would help in marketing. In terms of
identifying those most likely to switch, this may not be a particularly valuable insight.
However, targeting those with certain models and older phones could be profitable.
We would tentatively recommend that the following observations about switching
from 2G to 3G. 2G Customers who have



demonstrated some preference for games
are heavy users
have old phones of predominantly 2G model
should be offered 3G contracts with 3G popular models on 3G popular subplans
Table 1: Top 20 attributes by IG, IGR and Chi-Squared Statistic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
14
29
38
63
7
184
112
72
74
89
90
84
75
71
95
HS_MODEL
HS_AgeGroup
HS_AGE
AVG_BILL_AMT
SUBPLAN
STD_VAS_GAMES
AVG_VAS_GAMES
AVG_NO_CALLED
AVG_MINS_OB
AVG_MINS_MOB
AVG_MINS_INTRAN
AVG_MINS_OBPK
AVG_CALL_OB
AVG_CALL
AVG_CALL_LOCAL
0.1837
0.1333
0.1320
0.0760
0.0706
0.0635
0.0632
0.0571
0.0566
0.0566
0.0547
0.0542
0.0527
0.0522
0.0518
0.0960
0.0616
0.0501
0.0286
0.0358
0.0260
0.0259
0.0241
0.0235
0.0253
0.0223
0.0255
0.0225
0.0213
0.0213
6165
3834
3866
1884
1918
1739
1730
1512
1474
1468
1410
1453
1406
1362
1356
An examination of the least indicative attributes is also of interest. Note that some of
the attributes that one might have thought to be indicative, such as nationality,
delinquency and occupation, are amongst the least important.
Table 1: Bottom 20 attributes by IG, IGR and Chi-Squared Statistic
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
2
46
110
182
45
47
48
57
9
18
52
23
59
3
17
10
50
51
58
109
181
NATIONALITY
TOT_LAST_DELINQ_DIST
AVG_VAS_QTUNE
STD_VAS_QTUNE
TOT_LAST_DELINQ_DAYS
TOT_DELINQ_DAYS
TOT_PAST_DELINQ
AVG_DELINQ_DAYS
SUBPLAN_CHANGE_FLAG
REVPAY_PREV_CD
TOT_PAST_REVPAY
VAS_IB_FLAG
OD_FREQ
OCCUP_CD
ID_CHANGE_FLAG
CONTRACT_FLAG
TOT_PAST_TOS
TOT_TOS_DAYS
OD_REL_SIZE
AVG_VAS_QG
STD_VAS_QG
0.0008
0.0008
0.0008
0.0008
0.0007
0.0007
0.0007
0.0007
0.0007
0.0007
0.0005
0.0003
0.0003
0.0003
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0013
0.0028
0.0072
0.0072
0.0027
0.0028
0.0026
0.0026
0.0025
0.0035
0.0026
0.0057
0.0019
0.0005
0.0035
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
18
21
22
22
20
20
20
20
19
21
13
10
9
8
4
0
0
0
0
0
0
It is difficult to say much about modeling switching without any information on those
who have switched. However, an analysis of where the variation in the attributes occurs
can at least highlight where customers most differ. The top 5 principle components for
the continuous data are shown below.
-0.146*AVG_MINS-0.144*AVG_MINS_LOCAL-0.144*AVG_MINS_OB0.144*AVG_MINS_PK-0.141*AVG_CALL_OB
0.241*AVG_BILL_VOICEI+0.203*AVG_MINS_INT+0.195*AVG_BILL_VOICE+
0.193*STD_BILL_AMT+0.191*AVG_BILL_AMT
0.226*STD_T1_MINS_CON+0.213*STD_EXTRANT1_RATIO+0.201*STD_EXT
RAN_RATIO+0.198*STD_T1_CALL_CON+0.19* STD_OP_CALL_RATIO
0.244*STD_MINS_IB+0.238*STD_MINS_IBOP --0.213*AVG_EXTRAN_RATIO
+0.202*AVG_MINS_IBOP+0.201*STD_MINS_OP
0.298*AVG_PAST_OD_VALUE+0.298*AVG_OD_AMT-0.298*AVG_PAY_AMT
+0.298*STD_OD_AMT+0.298*STD_PAY_AMT
It is difficult to interpret these without expert knowledge. However, certain summary
behaviour is apparent, and could be useful in the modeling of customers. The first
component clearly relate to useage, whereas the second describes amount spent
(including some international contribution). Hence most of the variation in customers
can be explained by the amount of calls they make and the amount they spend. Given
the importance of call amounts in predicting 3G, it would be worthwhile spending
some time carefully modeling customers based on usage. However, the third
component identifies an alternative source of variation in the Standard Deviation
Fields. These fields are beyond our understanding, but may be worthy of
investigation.
Another approach to looking for indicators of switching is to examine the model for
areas of the attribute space where the classifier predicts a mixture of 2G/3G.
However, this is a dangerous activity, as variation is more likely to come from
incorrect modeling or natural variation. Many of the attributes may be highly
influenced by whether someone is already using 3G or not (for example, there may be
more games available on 3G). In our opinion a more detailed customer modeling, a
removal of possibly deceptive attributes and a market segmentation using clustering
may highlight more areas of the attribute space of particular interest in terms of
switching users.
References
[1] Bagnall, A. J. and Janacek, G. (2005) Clustering time series with clipped data. Machine
Learning,58(2): 151-178
[2] Bagnall, A. J., Janacek, G. and Powell, M. (2005) A likelihood ratio distance measure for the similarity
between the Fourier transform of time series, the 9th Pacific-Asia Conference on Knowledge Discovery
and Data Mining
[3] Bagnall, A. J., Ratanamahatan, C., Keogh, E., Lonardi, S. and Janacek, G.J. (2006) A Bit level
representation for time series data mining with shape based similarity, to appear in Data Mining and
Knowledge Discovery Journal, 2006.
[4] Bagnall, A. J., Whittley, I. M., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Variance Stabilizing
Regression Ensembles for Environmental Models, submitted to IEEE Congress on Computational
Intelligence, 2006
[5] Bull, L. and Studley, M. and Bagnall, A.J. and Whittley, I.M., "On the use of Rule Sharing in Learning
Classifier System", In Proceedings of the 2005 Congress on Evolutionary Computation, 2005
[6] Whittley, I. M., Bagnall, A. J., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Attribute Selection
Methods for Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR), to appear in
the International Workshop on Feature Selection for Data Mining, part of the 2006 SIAM Data Mining
Conference
[7] SCDM project website http://www.mc.manchester.ac.uk/scdm/
[8] Large Datasets for Feature Selection http://www2.cmp.uea.ac.uk/~ajb/SCDM/AttributeSelection.html
[9] CRISP http://www.crisp-dm.org/index.htm
Appendix A
This appendix provides details of the data formatting conducted prior to mining
2. MARITAL_STATUS:
Train
Count of
CUSTOMER_TYPE
Test
Count of GENDER
CUSTOMER_TYPE
MARITAL_STATUS
M
S
X
Grand Total
0
7340
6834
826
15000
1
1034
1750
216
3000
Grand
Total
8374
8584
1042
18000
MARITAL_STATUS
M
S
X
Grand Total
Total
2808
2822
370
6000
3. Nationality: We transform to just 7 classes. 0 is all others, including missing.
Train
702
458
0
156
608
356
360
Test
16513
535
318
264
167
112
88
702
458
156
0
608
356
360
5478
201
99
89
67
35
31
4. Occupation CD: Combine rare values
Original
Train
OCCUP_CD
AGT
CLRC
CLRF6603C
ENG
EXEC
FAC
GOVT
HWF
MED
MGR
OTH
POL
SELF
SHOP
STUD
TCHR
X
Grand Total
0
3
29
1
57
50
2
6
40
2
66
5015
48
21
3
115
15
9527
15000
1
12
13
16
2
2
8
18
978
20
6
1
35
2
1887
3000
Total
3
41
1
70
66
4
8
48
2
84
5993
68
27
4
150
17
11414
18000
Test
OCCUP_CD
AGT
CLRC
ENG
EXEC
GOVT
HWF
MED
MGR
OTH
POL
SELF
STUD
TCHR
X
Grand Total
Total
4
18
27
18
4
8
3
30
1960
27
10
52
5
3834
6000
New
Train
CLRC
ENG
EXEC
HWF
MGR
OTH
POL
SELF
STUD
X
Test
CLRC
ENG
EXEC
HWF
MGR
OTH
POL
SELF
STUD
X
41
70
66
48
84
6032
68
27
150
11411
18
27
18
8
30
1976
27
10
52
3834
5. Cobrand Flag
Train
Count of OCCUP_CD
Test
Count of
COBRAND_CARD_FLAG
CUSTOMER_TYPE
COBRAND_CARD_FLAG
0
1
Grand Total
0
13696
1304
15000
1
2587
413
3000
Grand
Total
16283
1717
18000
COBRAND_CARD_FLAG
0
1
Grand Total
Total
5405
595
6000
6. HIGHEND_PROGRAM_FLAG
Train
HIGHEND_PROGRAM_FLAG
0
1
Grand Total
0
14499
501
15000
1
2555
445
3000
Grand Total
17054
946
18000
Test
HIGHEND_PROGRAM_FLAG
0
1
Grand Total
7. CUSTOMER_CLASS
Train
Count of
CUSTOMER_CLASS
CUSTOMER_TYP
E
CUSTOMER_CLASS
0
1
12085
398
348
322
1577
138
121
11
2392
52
54
26
379
34
63
14477
450
402
348
1956
172
184
11
15000
3000
18000
3
4
5
6
7
8
9
10
Grand Total
Test
Sum of
CUSTOMER_CLASS
Grand
Total
CUSTOMER_CLASS
3
4
5
6
7
8
9
10
Grand Total
Total
1442
4
628
625
774
4627
440
567
20
2210
5
Total
5679
321
6000
8. SUBPLAN, SUBPLAN_PREVIOUS
These variables have a very large number of possible values, and there are mismatches
between test and training sets. We group low the plans as follows
Subplan: Class 1
2219
2214
2169
2164
2163
2128
2127
2118
2113
Class 2
4
5
2
2
4
2
3
3
4
4
3
6
1
3
6
5
4
9
2
5
10
3
6
9
9
2207
2204
2202
2197
2196
2187
2168
2159
2158
2152
2130
2116
2112
2109
4
3
1
1
6
1
2
1
2
3
5
6
1
3
1
7
6
5
5
4
4
3
2
2
1
1
1
1
1
1
1
1
Missing
2246
2244
2217
2110
1
1
1
1
Subplan previous
Class 1
2112
2115
2116
2130
2159
2162
2164
2185
2186
2187
2197
2202
2207
2215
6105
6106
Class 2
1
8
7
5
1
1
7
1
1
1
1
1
4
3
4
1
8. Contract Flag
1
1
1
9
7
5
1
1
8
1
1
1
1
1
4
3
4
1
2113
2118
2128
2152
2158
2169
2170
2196
2214
2219
2219
missing
5
4
4
3
2
4
2
5
3
3
4
2
2
1
4
7
4
4
4
8
7
8
5
4
1
8
9
9
4
4
2110
2127
2168
2199
2244
2246
1
1
1
Count of CONTRACT_FLAG
CUSTOMER_TYPE
SUBPLAN_CHANGE_FLAG
0
14266
731
14997
0
1
Grand Total
1
2795
205
3000
Grand
Total
17061
936
17997
9. PAY_METD
Count of
PAY_METD
CUSTOMER_TYPE
PAY_METD
cg
co
cs
cx
dd
X
Grand Total
0
603
1028
11294
138
1236
698
14997
1
132
328
2113
33
206
188
3000
Grand
Total
735
1356
13407
171
1442
886
17997
PAY_METD Prev
Count of
PAY_METD_PREV
CUSTOMER_TYPE
PAY_METD_PREV
cb
cg
ch
co
cs
cx
dd
X
Grand Total
0
421
9
822
11946
73
1028
698
14997
1
1
103
1
255
2262
21
169
188
3000
Grand
Total
1
524
10
1077
14208
94
1197
886
17997
LUCKY_NOS
Count of
LUCKY_NO_FLAG
CUSTOMER_TYPE
LUCKY_NO_FLAG
0
1
Grand Total
0
14296
701
14997
1
2694
306
3000
Grand
Total
16990
1007
17997
BLACKLIST
Count of
BLACK_LIST_FLAG
CUSTOMER_TYPE
BLACK_LIST_FLAG
0
13883
1114
14997
0
1
Grand Total
1
2921
79
3000
Grand
Total
16804
1193
17997
1
2981
19
3000
Grand
Total
17921
76
17997
ID_CHANGE
Count of
BLACK_LIST_FLAG
CUSTOMER_TYPE
ID_CHANGE_FLAG
0
14940
57
14997
0
1
Grand Total
REVPAY_PREV
All below 602 converted to 0
-732
-722
-702
-602
0
13
120
95
111
14658
8
42
28
19
2903
COUNTRY:
For all three, just retain select list
0
29
35
39
48
50
52
65
69
70
80
101
102
103
105
236
237
238
239
240
543
108
198
76
73
84
45
88
52
272
59
168
80
329
1980
418
160
91
87
84
21
162
123
130
17561
-732
-722
-702
-602
0
3
52
34
40
5871
241
242
248
254
258
260
NONE
Grand Total
63
210
379
206
203
54
11887
17997
Internationals
0
29
35
39
48
50
52
65
69
70
80
101
102
103
105
236
237
238
239
240
241
242
248
254
258
260
NONE
407
77
155
54
66
63
24
62
34
204
49
135
63
253
1449
306
116
66
76
61
37
163
278
160
132
42
10465
136
31
43
22
7
21
21
26
18
68
10
33
17
76
531
112
44
25
11
23
26
47
101
46
71
12
1422
543
108
198
76
73
84
45
88
52
272
59
168
80
329
1980
418
160
91
87
84
63
210
379
206
203
54
11887
0
29
35
39
48
50
52
65
69
70
80
101
102
103
105
236
237
238
239
240
241
242
248
254
258
260
NONE
175
34
60
32
28
28
12
28
12
76
12
44
25
125
701
121
53
26
32
20
14
78
131
83
64
12
3973
VAS_CND_FLAG: Note big class imbalance between train and test
Sum of
CUSTOMER_TY
PE
VAS_CND_FLAG
0
1
Grand Total
CUSTOMER_TY
PE
0
0
0
0
1
33
2967
3000
Grand
Total
33
2967
3000
Count of
VAS_CND_FLA
G
VAS_CND_FLA
G
0
1
Grand Total
Total
464
5536
6000
VAS_CNDD
Count of
CUSTOMER_TYPE
CUSTOMER_TYPE
VAS_CNND_FLAG
0
1
Grand Total
0
14499
498
14997
Count of
VAS_CNND_FLAG
1
2710
290
3000
Grand
Total
17209
788
17997
VAS_CNND_FLAG
0
1
Grand Total
Total
5763
237
6000
VAS_DRIVE: REMOVE
Count of
CUSTOMER_TYPE
CUSTOMER_TYPE
VAS_DRIVE_FLAG
0
14992
5
14997
0
1
Grand Total
1
3000
3000
Grand
Total
17992
5
17997
VAS_FF: Remove
Count of
CUSTOMER_TYPE
VAS_FF_FLAG
0
1
Grand Total
Sum of
VAS_FF_FLAG
CUSTOMER_TYPE
0
14832
165
14997
Grand
Total
17766
231
17997
1
2934
66
3000
VAS_FF_FLAG
0
1
Grand Total
Total
5909
91
6000
VAS_IB
Count of
CUSTOMER_TYPE
VAS_IB_FLAG
0
1
Grand Total
Count of
VAS_IB_FLAG
CUSTOMER_TYPE
0
14905
92
14997
1
2966
34
3000
Grand
Total
17871
126
17997
VAS_IB_FLAG
Total
5940
60
6000
0
1
Grand Total
VAS_NR
Count of
CUSTOMER_TYPE
VAS_NR_FLAG
0
1
Grand Total
Count of
VAS_IB_FLAG
CUSTOMER_TYPE
0
14746
251
14997
1
2892
108
3000
Grand
Total
17638
359
17997
VAS_NR_FLAG
0
1
Grand Total
Total
5879
121
6000
VAS_VM
Count of
CUSTOMER_TYP
E
CUSTOMER_TYP
E
VAS_VM_FLAG
Count of
VAS_IB_FLAG
Grand
Total
0
1
0
4456
1057
5513
0
1
10541
1943
12484
1
14997
3000
17997
Grand Total
VAS_VM_FLAG
Grand Total
Tota
l
186
8
413
2
600
0
VAS_VMN: Remove
VAS_VMP: Remove
Count of
CUSTOMER_TYPE
Count of
VAS_IB_FLAG
CUSTOMER_TYPE
VAS_VMP_FLAG
0
14996
1
14997
0
1
Grand Total
1
2999
1
3000
Grand
Total
17995
2
17997
VAS_VMP_FLAG
0
1
Grand Total
Total
5998
2
6000
VAS_SN_FLAG: Delete
VAS_GPRS
Count of
CUSTOMER_TY
PE
VAS_GPRS_FLA
G
0
CUSTOMER_TY
PE
1
Grand Total
0
631
1
20
Grand Total
651
14366
2980
17346
14997
3000
17997
Count of
VAS_IB_FLAG
VAS_GPRS_FL
AG
0
1
Grand Total
Tota
l
204
579
6
600
0
VAS_CSMS_FLAG: Remove
VAS_IEM_FLAG: Remove
VAS_AR_FLAG
Count of
CUSTOMER_TYP
E
0
1
Grand Total
Count of
VAS_IB_FLAG
VAS_AR_FLA
G
0
10574
1269
11843
0
1
4423
1731
6154
1
14997
3000
17997
VAS_AR_FLAG
Grand Total
CUSTOMER_TYP
E
Grand Total
Tota
l
399
3
200
7
600
0
TELE_CHANGE_FLAG: Remove
Download