PAKDD 2006 Data Mining Competition Institution: School of Computer Engineering Nanyang Technological University Group members: 1. Le Minh Tam 2. Vu Thuy Linh 3. Duong Minh Chau Problem statement A telco operator would like to make full use of existing customer usage and demographic data to identify potential customers to their newly-launched 3G mobile communication network. This is a typical classifying problem. Data sets 1. Overview Each customer data is represented by a record of 251 fields, ranging from demographic characteristics to mobile usage (in the last 6 months). The target field is CUSTOMER_TYPE, which identifies the type of customer (i.e. is he/she likely to switch to the new 3G service). 24000 sample data records are provided. Three quarters (18000) of the sample data are given with target field available, for training/testing purpose. The rest of the sample is to be used for prediction. 2. Statistics The ratio 3G:2G records in the sample data is 1:5 (i.e. 20k 2G and 4k 3G records). The training portion and testing portion also have the same ratio. We discarded the SERIAL_NUMBER attribute as it is irrelevant to the classification. After summarizing the attributes, we got the following statistics: - - 10% of the fields have value of zero in all training records. Those are of numeric type only. Some attributes are somehow related to each other, yet cannot be aggregated as each of them has its own meaning. Examples of those related groups are: (SUBPLAN, SUBPLAN_PREVIOUS), (TOP1_INT_CD, TOP2_INT_CD, TOP3_INT_CD). These are all categorical fields. Most of the numeric attributes possess a skewed distribution (right-skewed, in fact). Around 50% of those have only one peak at 0. 3. Division of training records We used 3 methods to pick the subsets of attributes: pick all attributes (excluding the SERIAL_NUMBER), pick the top-20 attributes whose entropies are smallest, and randomly pick a subset of attributes. The method to compute entropy will be represented in this document shortly after. In random attribute selection, we made use of the domain fact that 3G customers are likely to use more of non-voice service such as games, chat, GPRS, WAP, etc. Another fact to be considered was that demographic data such as AGE or GENDER may have an impact on the CUSTOMER_TYPE (e.g. younger people tend to follow suit technology trend, and hence may likely switch to 3G service) During training, the training data set was divided into 2 proportional sets with corresponding ratio of 3:1, 4:1, 9:1, 19:1. The method used to distribute the records is based on random number generator in Java (with seed number taken from machine’s timestamp) to guarantee the randomness of the activity. During the distribution, the proportion 3G:2G in the data sets were maintained. Here is the pseudo code for random distribution into 2 sub-datasets with ratio a:b rand = new Random( timestamp ); v = new Vector() fill v with record-index (SERIAL_NUMBER) of records d[] is an array to mark which record belongs to which sub data sets turnPositive = 0; turnNegative = 0; while (v has more records){ temp = abs(rand.nextInt()); if (temp is in v){ remove temp from v if (record numbered temp is positive){ if (turnPositive < a) d[temp] = 0; else d[temp] = 1; turnPositive = (turnPositive + 1) mod (a + b); } else{ if (turnNegative < a) d[temp] = 0; else d[temp] = 1; turnNegative = (turnNegative + 1) mod (a + b); } } move to next record of v; } use d[] to copy the records to the corresponding sub-dataset 0 and 1; The next section presents how we calculate the entropies of the attributes. 4. Calculating Entropy A. Entropy Formula The entropy e i of the i th interval: k ei pi j log 2 pij i where pij mi j / mi is the probability of class j int the i th interval. The total entropy: n e wi ei i 1 Where m is the number of value, wi mi / m is the fraction of values in the i th interval, n is the number of inverval. B. Implementing Entropy According to the data dictionary, there are two types of data: categorical and numeric as known as nominal and contiguous data type. Therefore, it’s up to the data types that we apply the entropy formula differently. For the categorical data: There will be a set of distinct values for that data. At each distinct value (including the missing/null value), we can count the number of customers who are classified as 2G, 3G. After getting the count results, we can calculate the entropy for each distinct value in the value set of that data. To finally get the entropy of the data, we have to do one more step according to the total entropy formula. We find out the fraction of each value in the set (which is the proportion of the number of customers having that value above the total number of customers in the data training file). Then we apply the formula to get the final entropy value of the categorical data from its distinct entropy values in the set. For the numeric data: We need to find the position to split data into 2 subsets, the entropies of which will be computed and combined into the final total entropy. Firstly, we sort all the value to get an order array. For example, after sorting, we have the array: 60 70 75 85 Split positions are identified by taking midpoints between two adjacent sorted values: 55 65 72 80 87 We use test condition to get data, count the number of 2G with value greater than 55, and the number of 3G with value greater than 55: 5 5 > 2 0 3 3 0 7 G G For the next node of split position 65, we check the CUSTOMER_TYPE of value 60, if it is 2G, we increase the number of 2G with value smaller than 65 and decrease the number of 2G with value greater than 65. If it is 3G, we increase the number of 3G with value smaller than 65 and decrease the number of 3G with value greater than 65. One after one, we create the table: 3G 2G Entropy 2G 2G 2G 3G 60 70 75 85 55 65 72 80 87 > 0 3 0 7 0.8813 > 0 3 1 6 0.8265 > 0 3 2 5 0.7635 > 0 3 3 4 0.6897 > 1 2 3 4 0.8755 The minimum entropy gives the position split. In this case, the split position is 80 and entropy equals to 0.6897 5. Measurements After picking the attributes, dividing the records into 2 sub-datasets, we used the larger sets for training and the smaller for testing. We measured the performance based on the precision and recall yielded on the testing subset. We combine the 2 measurements in the F1-index, which was computed by: F1 2* precision * recall precision recall The higher the F1-index, the better the performance. From the equation above, we see that for each test, a balance between precision and recall would results in the largest F1index. Classifiers Among all available classifying algorithms, we used Decision Tree (C4.5) and SVM (Support Vector Machine). C4.5 is provided in WEKA (http://www.cs.waikato.ac.nz/ml/weka/) and SVM is provided in SVMlight (http://svmlight.joachims.org/) in open-source form. Only some typical sample results will be presented below, although the same process might have been carried out for several times randomly. 1. Decision tree C4.5 Subset of attributes All 250 attributes 20 entropy-picked attributes 20 randomly-picked atributes 81 attributes selected based on shape of distribution Precision 48.83% 51.72% 60.66% 52.78% Recall 2.56% 3.5% 45.8% 48.63% F1-index 4.86% 6.55% 54.76% 50.62% According to the result above, we conclude that decision tree is not a good solution for this particular problem. Our explanation for this is that decision tree is mainly based on computation and comparison of information gain, and the rest of the decision mostly based on majority. Therefore, given a heavily-skewed distribution, with a high peak on the zero position, and only up to about 2% of the data set lies on the tail, the decision tree algorithm would not yield a good result. With all-attributes-picked, or entropy-picked, we can observe a precision of approximately 50%, while recall measure is miserably as low as less than 5%. The result indicates that the algorithm was only able to classify a few of the positive records which lie on the tail, while failing to classify the larger portion lies in the zero peak (because there is obviously much more 2G than 3G in the peak, majority rule would favor 2G). We suspected the reason why 20 randomly-picked attributes gave unexpectedly higher performance was the randomness of the selection which helped eliminating some of the attributes whose distributions are heavily skewed. We also made an attempt to prove the effect of distribution shape on the classifying performance. Out of 251 attributes, we selected 81, whose distribution shapes are distinguishable (i.e. not possessing a zero peak, and a small tail like the rest). The result was 52.78% precision and 48.63% of recall, which implies an F1-index of approximately 50%, which is better than all-picked, or entropy-picked, which consist of skewed fields. We conclude that the shape of distribution does have a great impact on the performance of the classification algorithm, as well as the choice of which algorithms to be used. 2. Support Vector Machine (SVM) Again, only typical sample results are given in this section, although the same test case might have been carried out for several times (including randomly distributing the subsets again and again) The first time we tested with SVM, we didn’t include the “multiple codes” categorical attributes (such as NATIONALITY, OCCUP_CD, etc.). We also made a mistake when splitting the binary flag attributes so that missing values were grouped together with zerovalue. Follow are the results: Subset of attributes All 250 attributes 20 entropy-picked attributes 20 randomly-picked attributes Ratio training testing set 3:1 4:1 9:1 19:1 3:1 9:1 3:1 9:1 19:1 of Cost vs. factor 2.8 2.5 2.59 2.55 2.5 2.5 2.47 2.515 2.51 Precision Recall F1-index 53.93% 54.62% 55.67% 51.37% 49.68% 48.59% 51.33% 55.03% 53.33% 53.96% 54.86% 55.67% 50.68% 44.88% 44.08% 51.33% 54.85% 53.33% 54% 55.1% 55.67% 50% 40.93% 40.33% 51.33% 54.67% 53.33% We made a second time testing with SVM, correcting the mistakes, and the results are as follow: Subset of attributes All 250 attributes 20 entropy-picked attributes 20 randomly-picked attributes Ratio of Cost training vs. factor testing set 3:1 3.25 9:1 3.3 19:1 3.3 9:1 2.4 19:1 2.3 3.1 2.6 9:1 2.6 19:1 2.6 Precision Recall F1-index 66.4% 65.22% 66.67% 47.67% 51.3% 53.97% 55.52% 58% 66.26% 65.11% 66.67% 47.67% 51.98% 53.24% 55.42% 58% 66.13% 65% 66.67% 47.67% 52.67% 52.53% 55.33% 58% Based on the tables above, it is apparent that the SVM yields a better result than decision tree C4.5, in this particular problem. The F1-index, tested with all-attributespicked, varies from 65% to above 70%, which is far better than the 4.86 F1-index of C4.5 When we ignored “multiple codes” fields, the result was lowered by approximately 10%. Mistakenly grouping missing values and zero-value resulted in inaccurate modeling of the data. We could also observe that all-attributes-picked set always yielded better result than the rest. Hence, we may conclude that the greater the number of attributes involved, the better SVM performs. Another observation worth of noticing is that SVM’s runtime tends to be faster as the number of attributes increases. The fewer the number of attributes picked, the more time it takes to optimize the model We also made several attempts to make use of the AVG and STD pairs of attributes in the data set. For example, we tried using the 99% confidence interval (AVG ± 3.27 STD), 75% confidence interval (AVG ± 1.5 STD) rather than using AVG and STD directly. Our idea was that when a pair of AVG and STD was fed into the classifying algorithm, it is treated as 2 separate numeric fields, without any relation in between, while in fact they together describe a distribution. However, the performance yielded was approximately the same (from 65% to 73%). Therefore, we conclude that SVM performs better, and it is able to select attributes by itself (rather than requiring manual selection of attributes). The submitted prediction data was generated by SVMlight, with cost factor of 3.3