Super Computer Data Mining Project Entry for the 2006 PAKDD

Super Computer Data Mining Project Entry for the 2006 PAKDD Data Mining Competition: Heterogeneous Classifier Ensemble Approach A. J. Bagnall1 G. Cawley1 L. Bull2 1 School of Computing Sciences University of East Anglia, Norwich, UK 2 School of Computer Science, University of West of England, Bristol, UK 1. Introduction This document describes the entry of the Super Computer Data Mining (SCDM) Project to the PAKDD 2006 data mining competition. The SCDM project is sponsored by the Engineering and Physical Sciences Research Council of the UK government (funded under grants GR/T18479/01, GR/T18455/01 and GR/T/18462/01) and began in January 2005. The objective of the project is to develop data mining tools, implemented in C++ and MPI, for parallel execution on super computer architecture. The main super computer facility we use is based at the University of Manchester and is called CSAR It forms part of the UK high performance computing (HPC) service. However, the SCDM code will run on any cluster and will be freely available to the academic community. The SCDM toolkit has already contributed to several research projects with six papers published [1,2,3,4,5,6] and more in preparation. More details can be found at the project website (still under development) [7]. Our motivation for this project is to develop tools that will be able to perform complex analysis of very large data with many attributes. We have assembled several large attribute, many cases data sets to form a standard test suite for the assessment of data mining algorithms on this type of data [8]. The main algorithmic focus of the project is on ensemble techniques, with a particular emphasis on attribute selection. This competition has been particularly useful for us for several reasons 1. The PAKDD competition data set will make a useful addition to the data collection. 2. It provides a test bed for our implementations of existing algorithms (for this work k-NN, C4.5, Naïve Bayes, Neural Network and Logistic Regression) 3. It allows us to assess new variations of classifiers (Learning Classifier Systems) and ensemble algorithms (FASBIR) 2. Approach and Understanding of the Problem The 2006 PAKDD data mining competition involves building classifiers to predict whether customers will choose a 3G or 2G phone. Our approach to this type of problem is normally to adopt a data mining methodology such as Cross Industry Standard Process for Data Mining (CRISP) [9]. Our ability to gain a good business understanding through interaction with the customer is obviously limited due to the nature of the competition. Nevertheless, a structured approach is always beneficial. Business Understanding Our only source of business understanding other than the data itself comes from the competition website. “An Asian telco operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network.” There is also some indication of the preferences of the customer in terms of the classification loss function. “The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided.” This indicates that the objective may be to target marketing towards potential customers in a way where the cost per individual is small, hence the priority is to not miss those most likely to want 3G. It is of course trivial to maximize the true positives by simply predicting everyone as 3G. However, such a solution is obviously of no interest. A proper modeling of the situation would involve a cost function for those both interested and not interested in 3G. The choice of cost function will obviously affect the ranking of our final entry. Given the stated assessment criteria “Entries will be assessed in terms of the number of 3G customers that are correctly classified by their model. A ranking table of the entries will be made available on this website in April 2006.” we assume a low cost for false positives and high profit for true positives. Since the true costs are unknown we also present several different cost scenarios in the analysis section, based on the training data. Data Understanding There are 18000 cases in the training data and 6000 in the test data, each case is an individual phone user. Of the 250 attributes 37 are categorical and are 213 numeric. All apart from the basic demographics (sex, age, nationality and marital status) relate to phone use. Of the usage attributes, there are average and standard deviations for 90 features. Many of these features are obviously related and correlated (for example, Average number of off-peak minutes and Average number of off-peak calls), and domain knowledge could be usefully employed to derive features (such as heavy user or international traveller). However, it is dangerous and probably fruitless to do so without consultation with domain experts. Hence we only use automated attribute transformations. Data Preparation The data requires significant pre-processing: Missing values: In addition to those indicated in the dictionary, there are several missing values with =#VALUE! Key, suggesting there may be some failed calculation in preprocessing. We have replaced these with missing value indicators. 21 attributes are all zeros. This should also be investigated. We have removed these attributes completely. Missing attribute values: several attribute values appear only in the training data or the testing data. For all attributes we Group all in train but not in test as “other” Change all in test but not in train as “missing” Cluster discrete attributes with a large number of classes. Many of the categorical variables have attribute values with very few observed values. We have clustered these together as dictated by the data. A full description is given in Appendix A. We also provide class distributions for some of the attributes to provide an informal idea of the discriminatory power. Note the most important of these are subplan and handset. In the given format these attributes have too many possible values to be of much use, but after formatting they prove highly indicative. Transformation of continuous attributes. Many of the continuous attributes have highly skewed distributions and there is a high degree of multi-collinearity. Given the lack of domain knowledge, we have taken three approaches to the continuous fields to reflect these characteristics. 1. Leave them as they are (after preprocessing described in Appendix A): File Mixed.csv (Version4_1.csv) contains all the formatted discrete attributes and the continuous fields as provided. 2. Discretise the continuous with the MDL method. File AllDiscrete.csv (Version4_2.csv) contains only discrete attributes 3. Transform into principle components, retaining only the components that explain 95% of the variation in the data. File MixedPCA.csv (Version4_4.csv) contains the formatted discrete attributes and the 67 principle components. Furthermore, we have derived binary dummy attributes for the discrete fields and thus created two new files for algorithms that are best suited to problems with only continuous attributes. These are AllContinuous.csv (Version4_3.csv) and AllContinuousPCA.csv (Version4_5.csv) 3. Modelling phase: details of the classification model that was produced Our approach is to produce a probability estimate using an ensemble of classifiers, some of which may themselves be ensembles. The core classifiers used are: 1. Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR) [10] is an ensemble k nearest neighbour algorithm that filters by information gain and randomizes with distance metrics and attribute subsets. We use the parameter values and implementation of FASBIR described in [6]. The final output of FASBIR is in fact the result of an ensemble of 100 alternative k-NN classifiers. FASBIR is run on AllContinuous.csv and AllContinuousPCA.csv 2. C4.5 Decision tree. Our C4.5 is a standard implementation comparable to the WEKA version. Based on past experience, we set the minimum leaf node size of C4.5 to 50. This is a crude but effective way of avoiding over fitting. C4.5 is run on Mixed.csv and AllDiscrete.csv 3. Naïve Bayes. Standard implementation assuming normality for real valued attributes. NB is run on AllDiscrete.csv and MixedPCA.csv 4. Logistic regression. Parameters estimated with a standard maximum likelihood estimation technique. AllContinuous.csv and AllContinuousPCA.csv 5. Neural Network. The NN has a single hidden layer, initially containing 32 or 64 neurons. The output layer utilised a Softmax activation function with a cross-entropy error metric and a standard 1-of-c encoding system. A Bayesian regularisation scheme was used to avoid overfitting, adopting a Laplace prior. AllContinuous.csv and AllContinuousPCA.csv 6. Learning Classifier System. LCS as described in [5], using an error weighted fitness function, a niched genetic algorithm and a Q-Learning style Widrow-Hoff update rule. AllDiscrete.csv This gives us 11 classifiers in the meta-ensemble. Each of the 11 classifiers produces a probability estimate for the training cases. We use a weighted mean of these probability estimates as our final prediction. The weights are determined by using a 10 fold cross validation and estimating the accuracy of each classifier. Once a probability estimate is obtained for each training case, a prediction is made using the following profit matrix. Profit Matrix Actual Predicted 3G (true) 2G (false) 3G (positive) 45 (TP) -5 (FP) 2G (negative) 0 (TN) 0 (FN) This is designed to simulate the idea of making marketing decisions based on the classifier: The cost of classifying as 3G is associated with the cost of a mail shot. If the individual is classified as 3G (positive) we assume a mail shot costing 5 units will be sent, and if not, no action will be taken (at 0 cost). We assume that if the mail shot reaches the 3G customer they will respond (true positive) and result in a profit of 50, whereas if they do not respond (FP) there is a loss of 5. Thus to decide whether to classify as 3G we take the action that maximizes the expected return. Of course, this may not be the intended use for the analysis, but a simple decision theory framework at least provides some structure to justify the classifications we make. Training Results The training data results indicate the resulting classifier is effective, although it may be based on information that is not of great use in predicting whether individuals will switch. Predicting whether someone uses a 3G phone based on their phone usage is not the same as predicting whether someone will switch from 2G to 3G. However, for the former task, we believe accurate predictions can be made. The contingency matrix below gives an accuracy of 87% and a balanced error rate of 14%. Predicted 2G 3G Grand Total Actual 2G 13197 1800 3G 14997 491 2509 Grand Total 13688 4309 3000 17997 We can get higher accuracy by changing the cost function. If we change our cost function to +5 for true positive we get the following outco me, essentially trading accuracy on 2G for 3G. Actual Predicted 2G 3G Grand Total ACC 2G 14459 538 14997 91.62% 3G 970 2030 3000 BER Grand Total 15429 2568 17997 17.96% Our entry for the test data predicts nearly half the customers as 3G. This overweighting towards 3G is because of the competition assessment criteria, which is to judge entries on the number of true positives. A more balanced (and hence more accurate) entry could easily be made. The ROC curve demonstrates how this decision boundary changes the outcome. Altering the reward moves the decision boundary, and hence the point on the ROC curve. Due to the lack of expertise, we have chosen in an ad hoc fashion based on our interpretation of the objectives. ROC Curve 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 4. Evaluation phase: discussion on what insights can be gained from their model Table 1 shows the ranking of the top 15 attributes by Information Gain, Information Gain Ratio and Chi-Squared Statistic. Note the importance of Handset model and age (after preprocessing). We would expect handset to be highly predictive of 3G, given that most phones are either 3G or 2G. However, there is variation in 2G/3G with handsets, so presumably some phones can be used in both contexts. Also of high discriminatory power is handset age, which is again not surprising, as 3G handsets have not been in production as long. Subplan is important (although incomprehensible to us) as is the amount spent and the amount of calls made. Perhaps interestingly, the GAMES fields appear high in the ranking, indicating that game players have a higher likelihood of using 3G services. We could very simplistically characterize a 3G user as a heavy user with a fairly new phone who plays games. These are not surprising results (this report should be seen as a preliminary investigation rather than a complete case study) but it does indicate that customer modeling could yield user profiles that would help in marketing. In terms of identifying those most likely to switch, this may not be a particularly valuable insight. However, targeting those with certain models and older phones could be profitable. We would tentatively recommend that the following observations about switching from 2G to 3G. 2G Customers who have    demonstrated some preference for games are heavy users have old phones of predominantly 2G model should be offered 3G contracts with 3G popular models on 3G popular subplans Table 1: Top 20 attributes by IG, IGR and Chi-Squared Statistic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 14 29 38 63 7 184 112 72 74 89 90 84 75 71 95 HS_MODEL HS_AgeGroup HS_AGE AVG_BILL_AMT SUBPLAN STD_VAS_GAMES AVG_VAS_GAMES AVG_NO_CALLED AVG_MINS_OB AVG_MINS_MOB AVG_MINS_INTRAN AVG_MINS_OBPK AVG_CALL_OB AVG_CALL AVG_CALL_LOCAL 0.1837 0.1333 0.1320 0.0760 0.0706 0.0635 0.0632 0.0571 0.0566 0.0566 0.0547 0.0542 0.0527 0.0522 0.0518 0.0960 0.0616 0.0501 0.0286 0.0358 0.0260 0.0259 0.0241 0.0235 0.0253 0.0223 0.0255 0.0225 0.0213 0.0213 6165 3834 3866 1884 1918 1739 1730 1512 1474 1468 1410 1453 1406 1362 1356 An examination of the least indicative attributes is also of interest. Note that some of the attributes that one might have thought to be indicative, such as nationality, delinquency and occupation, are amongst the least important. Table 1: Bottom 20 attributes by IG, IGR and Chi-Squared Statistic 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 2 46 110 182 45 47 48 57 9 18 52 23 59 3 17 10 50 51 58 109 181 NATIONALITY TOT_LAST_DELINQ_DIST AVG_VAS_QTUNE STD_VAS_QTUNE TOT_LAST_DELINQ_DAYS TOT_DELINQ_DAYS TOT_PAST_DELINQ AVG_DELINQ_DAYS SUBPLAN_CHANGE_FLAG REVPAY_PREV_CD TOT_PAST_REVPAY VAS_IB_FLAG OD_FREQ OCCUP_CD ID_CHANGE_FLAG CONTRACT_FLAG TOT_PAST_TOS TOT_TOS_DAYS OD_REL_SIZE AVG_VAS_QG STD_VAS_QG 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0005 0.0003 0.0003 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0013 0.0028 0.0072 0.0072 0.0027 0.0028 0.0026 0.0026 0.0025 0.0035 0.0026 0.0057 0.0019 0.0005 0.0035 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 18 21 22 22 20 20 20 20 19 21 13 10 9 8 4 0 0 0 0 0 0 It is difficult to say much about modeling switching without any information on those who have switched. However, an analysis of where the variation in the attributes occurs can at least highlight where customers most differ. The top 5 principle components for the continuous data are shown below. -0.146*AVG_MINS-0.144*AVG_MINS_LOCAL-0.144*AVG_MINS_OB0.144*AVG_MINS_PK-0.141*AVG_CALL_OB 0.241*AVG_BILL_VOICEI+0.203*AVG_MINS_INT+0.195*AVG_BILL_VOICE+ 0.193*STD_BILL_AMT+0.191*AVG_BILL_AMT 0.226*STD_T1_MINS_CON+0.213*STD_EXTRANT1_RATIO+0.201*STD_EXT RAN_RATIO+0.198*STD_T1_CALL_CON+0.19* STD_OP_CALL_RATIO 0.244*STD_MINS_IB+0.238*STD_MINS_IBOP --0.213*AVG_EXTRAN_RATIO +0.202*AVG_MINS_IBOP+0.201*STD_MINS_OP 0.298*AVG_PAST_OD_VALUE+0.298*AVG_OD_AMT-0.298*AVG_PAY_AMT +0.298*STD_OD_AMT+0.298*STD_PAY_AMT It is difficult to interpret these without expert knowledge. However, certain summary behaviour is apparent, and could be useful in the modeling of customers. The first component clearly relate to useage, whereas the second describes amount spent (including some international contribution). Hence most of the variation in customers can be explained by the amount of calls they make and the amount they spend. Given the importance of call amounts in predicting 3G, it would be worthwhile spending some time carefully modeling customers based on usage. However, the third component identifies an alternative source of variation in the Standard Deviation Fields. These fields are beyond our understanding, but may be worthy of investigation. Another approach to looking for indicators of switching is to examine the model for areas of the attribute space where the classifier predicts a mixture of 2G/3G. However, this is a dangerous activity, as variation is more likely to come from incorrect modeling or natural variation. Many of the attributes may be highly influenced by whether someone is already using 3G or not (for example, there may be more games available on 3G). In our opinion a more detailed customer modeling, a removal of possibly deceptive attributes and a market segmentation using clustering may highlight more areas of the attribute space of particular interest in terms of switching users. References [1] Bagnall, A. J. and Janacek, G. (2005) Clustering time series with clipped data. Machine Learning,58(2): 151-178 [2] Bagnall, A. J., Janacek, G. and Powell, M. (2005) A likelihood ratio distance measure for the similarity between the Fourier transform of time series, the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining [3] Bagnall, A. J., Ratanamahatan, C., Keogh, E., Lonardi, S. and Janacek, G.J. (2006) A Bit level representation for time series data mining with shape based similarity, to appear in Data Mining and Knowledge Discovery Journal, 2006. [4] Bagnall, A. J., Whittley, I. M., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Variance Stabilizing Regression Ensembles for Environmental Models, submitted to IEEE Congress on Computational Intelligence, 2006 [5] Bull, L. and Studley, M. and Bagnall, A.J. and Whittley, I.M., "On the use of Rule Sharing in Learning Classifier System", In Proceedings of the 2005 Congress on Evolutionary Computation, 2005 [6] Whittley, I. M., Bagnall, A. J., Bull, L., Pettipher, M., Studley, M. and Tekiner F., Attribute Selection Methods for Filtered Attribute Subspace based Bagging with Injected Randomness (FASBIR), to appear in the International Workshop on Feature Selection for Data Mining, part of the 2006 SIAM Data Mining Conference [7] SCDM project website http://www.mc.manchester.ac.uk/scdm/ [8] Large Datasets for Feature Selection http://www2.cmp.uea.ac.uk/~ajb/SCDM/AttributeSelection.html [9] CRISP http://www.crisp-dm.org/index.htm Appendix A This appendix provides details of the data formatting conducted prior to mining 2. MARITAL_STATUS: Train Count of CUSTOMER_TYPE Test Count of GENDER CUSTOMER_TYPE MARITAL_STATUS M S X Grand Total 0 7340 6834 826 15000 1 1034 1750 216 3000 Grand Total 8374 8584 1042 18000 MARITAL_STATUS M S X Grand Total Total 2808 2822 370 6000 3. Nationality: We transform to just 7 classes. 0 is all others, including missing. Train 702 458 0 156 608 356 360 Test 16513 535 318 264 167 112 88 702 458 156 0 608 356 360 5478 201 99 89 67 35 31 4. Occupation CD: Combine rare values Original Train OCCUP_CD AGT CLRC CLRF6603C ENG EXEC FAC GOVT HWF MED MGR OTH POL SELF SHOP STUD TCHR X Grand Total 0 3 29 1 57 50 2 6 40 2 66 5015 48 21 3 115 15 9527 15000 1 12 13 16 2 2 8 18 978 20 6 1 35 2 1887 3000 Total 3 41 1 70 66 4 8 48 2 84 5993 68 27 4 150 17 11414 18000 Test OCCUP_CD AGT CLRC ENG EXEC GOVT HWF MED MGR OTH POL SELF STUD TCHR X Grand Total Total 4 18 27 18 4 8 3 30 1960 27 10 52 5 3834 6000 New Train CLRC ENG EXEC HWF MGR OTH POL SELF STUD X Test CLRC ENG EXEC HWF MGR OTH POL SELF STUD X 41 70 66 48 84 6032 68 27 150 11411 18 27 18 8 30 1976 27 10 52 3834 5. Cobrand Flag Train Count of OCCUP_CD Test Count of COBRAND_CARD_FLAG CUSTOMER_TYPE COBRAND_CARD_FLAG 0 1 Grand Total 0 13696 1304 15000 1 2587 413 3000 Grand Total 16283 1717 18000 COBRAND_CARD_FLAG 0 1 Grand Total Total 5405 595 6000 6. HIGHEND_PROGRAM_FLAG Train HIGHEND_PROGRAM_FLAG 0 1 Grand Total 0 14499 501 15000 1 2555 445 3000 Grand Total 17054 946 18000 Test HIGHEND_PROGRAM_FLAG 0 1 Grand Total 7. CUSTOMER_CLASS Train Count of CUSTOMER_CLASS CUSTOMER_TYP E CUSTOMER_CLASS 0 1 12085 398 348 322 1577 138 121 11 2392 52 54 26 379 34 63 14477 450 402 348 1956 172 184 11 15000 3000 18000 3 4 5 6 7 8 9 10 Grand Total Test Sum of CUSTOMER_CLASS Grand Total CUSTOMER_CLASS 3 4 5 6 7 8 9 10 Grand Total Total 1442 4 628 625 774 4627 440 567 20 2210 5 Total 5679 321 6000 8. SUBPLAN, SUBPLAN_PREVIOUS These variables have a very large number of possible values, and there are mismatches between test and training sets. We group low the plans as follows Subplan: Class 1 2219 2214 2169 2164 2163 2128 2127 2118 2113 Class 2 4 5 2 2 4 2 3 3 4 4 3 6 1 3 6 5 4 9 2 5 10 3 6 9 9 2207 2204 2202 2197 2196 2187 2168 2159 2158 2152 2130 2116 2112 2109 4 3 1 1 6 1 2 1 2 3 5 6 1 3 1 7 6 5 5 4 4 3 2 2 1 1 1 1 1 1 1 1 Missing 2246 2244 2217 2110 1 1 1 1 Subplan previous Class 1 2112 2115 2116 2130 2159 2162 2164 2185 2186 2187 2197 2202 2207 2215 6105 6106 Class 2 1 8 7 5 1 1 7 1 1 1 1 1 4 3 4 1 8. Contract Flag 1 1 1 9 7 5 1 1 8 1 1 1 1 1 4 3 4 1 2113 2118 2128 2152 2158 2169 2170 2196 2214 2219 2219 missing 5 4 4 3 2 4 2 5 3 3 4 2 2 1 4 7 4 4 4 8 7 8 5 4 1 8 9 9 4 4 2110 2127 2168 2199 2244 2246 1 1 1 Count of CONTRACT_FLAG CUSTOMER_TYPE SUBPLAN_CHANGE_FLAG 0 14266 731 14997 0 1 Grand Total 1 2795 205 3000 Grand Total 17061 936 17997 9. PAY_METD Count of PAY_METD CUSTOMER_TYPE PAY_METD cg co cs cx dd X Grand Total 0 603 1028 11294 138 1236 698 14997 1 132 328 2113 33 206 188 3000 Grand Total 735 1356 13407 171 1442 886 17997 PAY_METD Prev Count of PAY_METD_PREV CUSTOMER_TYPE PAY_METD_PREV cb cg ch co cs cx dd X Grand Total 0 421 9 822 11946 73 1028 698 14997 1 1 103 1 255 2262 21 169 188 3000 Grand Total 1 524 10 1077 14208 94 1197 886 17997 LUCKY_NOS Count of LUCKY_NO_FLAG CUSTOMER_TYPE LUCKY_NO_FLAG 0 1 Grand Total 0 14296 701 14997 1 2694 306 3000 Grand Total 16990 1007 17997 BLACKLIST Count of BLACK_LIST_FLAG CUSTOMER_TYPE BLACK_LIST_FLAG 0 13883 1114 14997 0 1 Grand Total 1 2921 79 3000 Grand Total 16804 1193 17997 1 2981 19 3000 Grand Total 17921 76 17997 ID_CHANGE Count of BLACK_LIST_FLAG CUSTOMER_TYPE ID_CHANGE_FLAG 0 14940 57 14997 0 1 Grand Total REVPAY_PREV All below 602 converted to 0 -732 -722 -702 -602 0 13 120 95 111 14658 8 42 28 19 2903 COUNTRY: For all three, just retain select list 0 29 35 39 48 50 52 65 69 70 80 101 102 103 105 236 237 238 239 240 543 108 198 76 73 84 45 88 52 272 59 168 80 329 1980 418 160 91 87 84 21 162 123 130 17561 -732 -722 -702 -602 0 3 52 34 40 5871 241 242 248 254 258 260 NONE Grand Total 63 210 379 206 203 54 11887 17997 Internationals 0 29 35 39 48 50 52 65 69 70 80 101 102 103 105 236 237 238 239 240 241 242 248 254 258 260 NONE 407 77 155 54 66 63 24 62 34 204 49 135 63 253 1449 306 116 66 76 61 37 163 278 160 132 42 10465 136 31 43 22 7 21 21 26 18 68 10 33 17 76 531 112 44 25 11 23 26 47 101 46 71 12 1422 543 108 198 76 73 84 45 88 52 272 59 168 80 329 1980 418 160 91 87 84 63 210 379 206 203 54 11887 0 29 35 39 48 50 52 65 69 70 80 101 102 103 105 236 237 238 239 240 241 242 248 254 258 260 NONE 175 34 60 32 28 28 12 28 12 76 12 44 25 125 701 121 53 26 32 20 14 78 131 83 64 12 3973 VAS_CND_FLAG: Note big class imbalance between train and test Sum of CUSTOMER_TY PE VAS_CND_FLAG 0 1 Grand Total CUSTOMER_TY PE 0 0 0 0 1 33 2967 3000 Grand Total 33 2967 3000 Count of VAS_CND_FLA G VAS_CND_FLA G 0 1 Grand Total Total 464 5536 6000 VAS_CNDD Count of CUSTOMER_TYPE CUSTOMER_TYPE VAS_CNND_FLAG 0 1 Grand Total 0 14499 498 14997 Count of VAS_CNND_FLAG 1 2710 290 3000 Grand Total 17209 788 17997 VAS_CNND_FLAG 0 1 Grand Total Total 5763 237 6000 VAS_DRIVE: REMOVE Count of CUSTOMER_TYPE CUSTOMER_TYPE VAS_DRIVE_FLAG 0 14992 5 14997 0 1 Grand Total 1 3000 3000 Grand Total 17992 5 17997 VAS_FF: Remove Count of CUSTOMER_TYPE VAS_FF_FLAG 0 1 Grand Total Sum of VAS_FF_FLAG CUSTOMER_TYPE 0 14832 165 14997 Grand Total 17766 231 17997 1 2934 66 3000 VAS_FF_FLAG 0 1 Grand Total Total 5909 91 6000 VAS_IB Count of CUSTOMER_TYPE VAS_IB_FLAG 0 1 Grand Total Count of VAS_IB_FLAG CUSTOMER_TYPE 0 14905 92 14997 1 2966 34 3000 Grand Total 17871 126 17997 VAS_IB_FLAG Total 5940 60 6000 0 1 Grand Total VAS_NR Count of CUSTOMER_TYPE VAS_NR_FLAG 0 1 Grand Total Count of VAS_IB_FLAG CUSTOMER_TYPE 0 14746 251 14997 1 2892 108 3000 Grand Total 17638 359 17997 VAS_NR_FLAG 0 1 Grand Total Total 5879 121 6000 VAS_VM Count of CUSTOMER_TYP E CUSTOMER_TYP E VAS_VM_FLAG Count of VAS_IB_FLAG Grand Total 0 1 0 4456 1057 5513 0 1 10541 1943 12484 1 14997 3000 17997 Grand Total VAS_VM_FLAG Grand Total Tota l 186 8 413 2 600 0 VAS_VMN: Remove VAS_VMP: Remove Count of CUSTOMER_TYPE Count of VAS_IB_FLAG CUSTOMER_TYPE VAS_VMP_FLAG 0 14996 1 14997 0 1 Grand Total 1 2999 1 3000 Grand Total 17995 2 17997 VAS_VMP_FLAG 0 1 Grand Total Total 5998 2 6000 VAS_SN_FLAG: Delete VAS_GPRS Count of CUSTOMER_TY PE VAS_GPRS_FLA G 0 CUSTOMER_TY PE 1 Grand Total 0 631 1 20 Grand Total 651 14366 2980 17346 14997 3000 17997 Count of VAS_IB_FLAG VAS_GPRS_FL AG 0 1 Grand Total Tota l 204 579 6 600 0 VAS_CSMS_FLAG: Remove VAS_IEM_FLAG: Remove VAS_AR_FLAG Count of CUSTOMER_TYP E 0 1 Grand Total Count of VAS_IB_FLAG VAS_AR_FLA G 0 10574 1269 11843 0 1 4423 1731 6154 1 14997 3000 17997 VAS_AR_FLAG Grand Total CUSTOMER_TYP E Grand Total Tota l 399 3 200 7 600 0 TELE_CHANGE_FLAG: Remove

Super Computer Data Mining Project Entry for the 2006 PAKDD

Related documents

Products

Support

Super Computer Data Mining Project Entry for the 2006 PAKDD

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib