PAKDD 2006 Data Mining Competition

advertisement
PAKDD 2006 Data Mining Competition
Institution:
School of Computer Engineering
Nanyang Technological University
Group members:
1. Le Minh Tam
2. Vu Thuy Linh
3. Duong Minh Chau
Problem statement
A telco operator would like to make full use of existing customer usage and demographic
data to identify potential customers to their newly-launched 3G mobile communication
network. This is a typical classifying problem.
Data sets
1. Overview
Each customer data is represented by a record of 251 fields, ranging from demographic
characteristics to mobile usage (in the last 6 months). The target field is
CUSTOMER_TYPE, which identifies the type of customer (i.e. is he/she likely to switch
to the new 3G service). 24000 sample data records are provided. Three quarters (18000)
of the sample data are given with target field available, for training/testing purpose. The
rest of the sample is to be used for prediction.
2. Statistics
The ratio 3G:2G records in the sample data is 1:5 (i.e. 20k 2G and 4k 3G records). The
training portion and testing portion also have the same ratio. We discarded the
SERIAL_NUMBER attribute as it is irrelevant to the classification. After summarizing
the attributes, we got the following statistics:
-
-
10% of the fields have value of zero in all training records. Those are of numeric
type only.
Some attributes are somehow related to each other, yet cannot be aggregated as
each of them has its own meaning. Examples of those related groups are:
(SUBPLAN, SUBPLAN_PREVIOUS), (TOP1_INT_CD, TOP2_INT_CD,
TOP3_INT_CD). These are all categorical fields.
Most of the numeric attributes possess a skewed distribution (right-skewed, in
fact). Around 50% of those have only one peak at 0.
3. Division of training records
We used 3 methods to pick the subsets of attributes: pick all attributes (excluding the
SERIAL_NUMBER), pick the top-20 attributes whose entropies are smallest, and
randomly pick a subset of attributes. The method to compute entropy will be represented
in this document shortly after.
In random attribute selection, we made use of the domain fact that 3G customers are
likely to use more of non-voice service such as games, chat, GPRS, WAP, etc. Another
fact to be considered was that demographic data such as AGE or GENDER may have an
impact on the CUSTOMER_TYPE (e.g. younger people tend to follow suit technology
trend, and hence may likely switch to 3G service)
During training, the training data set was divided into 2 proportional sets with
corresponding ratio of 3:1, 4:1, 9:1, 19:1. The method used to distribute the records is
based on random number generator in Java (with seed number taken from machine’s
timestamp) to guarantee the randomness of the activity. During the distribution, the
proportion 3G:2G in the data sets were maintained. Here is the pseudo code for random
distribution into 2 sub-datasets with ratio a:b
rand = new Random( timestamp );
v = new Vector()
fill v with record-index (SERIAL_NUMBER) of records
d[] is an array to mark which record belongs to which sub data
sets
turnPositive = 0;
turnNegative = 0;
while (v has more records){
temp = abs(rand.nextInt());
if (temp is in v){
remove temp from v
if (record numbered temp is positive){
if (turnPositive < a)
d[temp] = 0;
else
d[temp] = 1;
turnPositive = (turnPositive + 1) mod (a + b);
}
else{
if (turnNegative < a)
d[temp] = 0;
else
d[temp] = 1;
turnNegative = (turnNegative + 1) mod (a + b);
}
}
move to next record of v;
}
use d[] to copy the records to the corresponding sub-dataset 0
and 1;
The next section presents how we calculate the entropies of the attributes.
4. Calculating Entropy
A. Entropy Formula
The entropy e i of the i th interval:
k
ei   pi j log 2 pij
i
where pij  mi j / mi is the probability of class j int the i th interval.
The total entropy:
n
e   wi ei
i 1
Where m is the number of value, wi  mi / m is the fraction of values in the i th
interval, n is the number of inverval.
B. Implementing Entropy
According to the data dictionary, there are two types of data: categorical and numeric
as known as nominal and contiguous data type. Therefore, it’s up to the data types that
we apply the entropy formula differently.
For the categorical data:
There will be a set of distinct values for that data. At each distinct value
(including the missing/null value), we can count the number of customers who are
classified as 2G, 3G. After getting the count results, we can calculate the entropy for each
distinct value in the value set of that data. To finally get the entropy of the data, we have
to do one more step according to the total entropy formula. We find out the fraction of
each value in the set (which is the proportion of the number of customers having that
value above the total number of customers in the data training file). Then we apply the
formula to get the final entropy value of the categorical data from its distinct entropy
values in the set.
For the numeric data:
We need to find the position to split data into 2 subsets, the entropies of which
will be computed and combined into the final total entropy.
Firstly, we sort all the value to get an order array. For example, after sorting, we
have the array:
60
70
75
85
Split positions are identified by taking midpoints between two adjacent sorted
values:
55
65
72
80
87
We use test condition  to get data, count the number of 2G with value greater
than 55, and the number of 3G with value greater than 55:
5
5
>
2
0
3
3
0
7
G
G
For the next node of split position 65, we check the CUSTOMER_TYPE of value
60, if it is 2G, we increase the number of 2G with value smaller than 65 and decrease the
number of 2G with value greater than 65. If it is 3G, we increase the number of 3G with
value smaller than 65 and decrease the number of 3G with value greater than 65. One
after one, we create the table:
3G
2G
Entropy
2G
2G
2G
3G
60
70
75
85
55
65
72
80
87
 >
0 3
0 7
0.8813
 >
0 3
1 6
0.8265
 >
0 3
2 5
0.7635
 >
0 3
3 4
0.6897
 >
1 2
3 4
0.8755
The minimum entropy gives the position split. In this case, the split position is 80
and entropy equals to 0.6897
5. Measurements
After picking the attributes, dividing the records into 2 sub-datasets, we used the larger
sets for training and the smaller for testing. We measured the performance based on the
precision and recall yielded on the testing subset. We combine the 2 measurements in the
F1-index, which was computed by:
F1 
2* precision * recall
precision  recall
The higher the F1-index, the better the performance. From the equation above, we see
that for each test, a balance between precision and recall would results in the largest F1index.
Classifiers
Among all available classifying algorithms, we used Decision Tree (C4.5) and SVM
(Support
Vector
Machine).
C4.5
is
provided
in
WEKA
(http://www.cs.waikato.ac.nz/ml/weka/) and SVM is provided in SVMlight
(http://svmlight.joachims.org/) in open-source form. Only some typical sample results
will be presented below, although the same process might have been carried out for
several times randomly.
1. Decision tree C4.5
Subset of attributes
All 250 attributes
20 entropy-picked attributes
20 randomly-picked atributes
81 attributes selected based on shape of distribution
Precision
48.83%
51.72%
60.66%
52.78%
Recall
2.56%
3.5%
45.8%
48.63%
F1-index
4.86%
6.55%
54.76%
50.62%
According to the result above, we conclude that decision tree is not a good solution
for this particular problem. Our explanation for this is that decision tree is mainly based
on computation and comparison of information gain, and the rest of the decision mostly
based on majority. Therefore, given a heavily-skewed distribution, with a high peak on
the zero position, and only up to about 2% of the data set lies on the tail, the decision tree
algorithm would not yield a good result. With all-attributes-picked, or entropy-picked, we
can observe a precision of approximately 50%, while recall measure is miserably as low
as less than 5%. The result indicates that the algorithm was only able to classify a few of
the positive records which lie on the tail, while failing to classify the larger portion lies in
the zero peak (because there is obviously much more 2G than 3G in the peak, majority
rule would favor 2G). We suspected the reason why 20 randomly-picked attributes gave
unexpectedly higher performance was the randomness of the selection which helped
eliminating some of the attributes whose distributions are heavily skewed.
We also made an attempt to prove the effect of distribution shape on the classifying
performance. Out of 251 attributes, we selected 81, whose distribution shapes are
distinguishable (i.e. not possessing a zero peak, and a small tail like the rest). The result
was 52.78% precision and 48.63% of recall, which implies an F1-index of approximately
50%, which is better than all-picked, or entropy-picked, which consist of skewed fields.
We conclude that the shape of distribution does have a great impact on the
performance of the classification algorithm, as well as the choice of which algorithms to
be used.
2. Support Vector Machine (SVM)
Again, only typical sample results are given in this section, although the same test
case might have been carried out for several times (including randomly distributing the
subsets again and again)
The first time we tested with SVM, we didn’t include the “multiple codes” categorical
attributes (such as NATIONALITY, OCCUP_CD, etc.). We also made a mistake when
splitting the binary flag attributes so that missing values were grouped together with zerovalue. Follow are the results:
Subset of attributes
All 250 attributes
20 entropy-picked
attributes
20 randomly-picked
attributes
Ratio
training
testing set
3:1
4:1
9:1
19:1
3:1
9:1
3:1
9:1
19:1
of Cost
vs. factor
2.8
2.5
2.59
2.55
2.5
2.5
2.47
2.515
2.51
Precision Recall
F1-index
53.93%
54.62%
55.67%
51.37%
49.68%
48.59%
51.33%
55.03%
53.33%
53.96%
54.86%
55.67%
50.68%
44.88%
44.08%
51.33%
54.85%
53.33%
54%
55.1%
55.67%
50%
40.93%
40.33%
51.33%
54.67%
53.33%
We made a second time testing with SVM, correcting the mistakes, and the results are
as follow:
Subset of attributes
All 250 attributes
20 entropy-picked
attributes
20 randomly-picked
attributes
Ratio
of Cost
training vs. factor
testing set
3:1
3.25
9:1
3.3
19:1
3.3
9:1
2.4
19:1
2.3
3.1
2.6
9:1
2.6
19:1
2.6
Precision Recall
F1-index
66.4%
65.22%
66.67%
47.67%
51.3%
53.97%
55.52%
58%
66.26%
65.11%
66.67%
47.67%
51.98%
53.24%
55.42%
58%
66.13%
65%
66.67%
47.67%
52.67%
52.53%
55.33%
58%
Based on the tables above, it is apparent that the SVM yields a better result than
decision tree C4.5, in this particular problem. The F1-index, tested with all-attributespicked, varies from 65% to above 70%, which is far better than the 4.86 F1-index of C4.5
When we ignored “multiple codes” fields, the result was lowered by approximately
10%. Mistakenly grouping missing values and zero-value resulted in inaccurate modeling
of the data.
We could also observe that all-attributes-picked set always yielded better result than
the rest. Hence, we may conclude that the greater the number of attributes involved, the
better SVM performs.
Another observation worth of noticing is that SVM’s runtime tends to be faster as the
number of attributes increases. The fewer the number of attributes picked, the more time
it takes to optimize the model
We also made several attempts to make use of the AVG and STD pairs of attributes in
the data set. For example, we tried using the 99% confidence interval (AVG ± 3.27 STD),
75% confidence interval (AVG ± 1.5 STD) rather than using AVG and STD directly. Our
idea was that when a pair of AVG and STD was fed into the classifying algorithm, it is
treated as 2 separate numeric fields, without any relation in between, while in fact they
together describe a distribution. However, the performance yielded was approximately
the same (from 65% to 73%).
Therefore, we conclude that SVM performs better, and it is able to select attributes by
itself (rather than requiring manual selection of attributes).
The submitted prediction data was generated by SVMlight, with cost factor of
3.3
Download