PAKDD 2006 Data Mining Competition Write-Up Participant Name: Nguyen Hoang Anh Problem Summary An Asian Telco operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network. An original sample dataset of 20,000 2G network customers and 4,000 3G network customers has been provided with more than 200 data fields. The target categorical variable is “Customer_Type” (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be made available with the target field missing and is meant to be used for prediction. The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided. Understanding of the problem: As the problem already stated, the data mining task is a classification problem for which the objective is to predict accurately as many current 3G customers as possible. This classification or prediction task will be done by a model that is generated from 18000 sets of customers which are already classified. The problem becomes easier when we know that in the prediction data, there are 5000 2G customers and 1000 3G customers. Therefore, we can control the setting of the algorithm so that it can come out with the best predicted 1000 3G customers. In real life, this prediction task can be used for marketing purpose. If the company knows which customers likely want to switch to using 3G, it can have better marketing strategies for these targeted customers. That is why it is better to classify a 2G customer into the 3G customer type than to classify a 3G customer into the 2G customer type. Approaching the problem: Support Vector Machines (SVMs) was used for this classification purpose. The following is the general ideas of the algorithm. 1. Introduction to SVMs: Support Vector Machines were developed by Vapnik in 1995 based on the Structural Risk Minimization principle from statistical learning theory. The idea of structural risk minimization is to find a hypothesis h from a hypothesis space H for which one can guarantee the lowest probability of error Err(h) for a given training examples S (x1,y1)…(xn,yn) xi € RN, yi € {-1,+1} For simplicity, let us assume that the training data can be illustrated on a plane x and can be separated by at least a hyperplane h’ . Optimal Hyperplane δ Support vector δ δ This means that there is a weight vector w’ and a threshold b’, so that all positive training examples are on one side of the hyperplane while all negative training examples lie on the other side. This is equivalent to requiring yi[wT.xi+b’]>0 for each training example (xi,yi). In other words, the equation of the hyperplane which does this separation is: wT.x+b = 0 so, wTxi+b ≥ 0 for yi = +1 wTxi+b < 0 for yi = -1 In general, there can be multiple hyperplanes that separate the training data without error. From these hyperplanes, the Support Vector Machine chooses the Optimal Hyperspace with largest margin δ. This particular hyperplane h(x) is shown in the right hand picture. The margin δ is the distance from the hyperplane to the closed training examples. For each training example set, there is only one hyperplane with maximum margin. The examples closest to the hyperplane are called Support Vectors. They have a distance of exactly δ. 2. SVMLights: Although there are many implementations of Support Vector Machines (SVMs) in the market now, SVM Light, an implementation of SVMs in C, seems to be the most popular for its high precision rate. SVM Light has been used as a basic binary classifier for this classification task. SVM Light can be downloaded from here: http://svmlight.joachims.org/ Full technical details of algorithm(s) used The training and testing data was provided in an Excel sheet that has more than 250 data fields. Each of these fields was easily represented by a feature number and the feature value represented the value of each field. 1. Data cleaning and relevance analysis: Data cleaning refers to the preprocessing of data in order to remove or reduce noise. As not all the data in 250 fields are useful or relevant, removing some of the fields would help reducing the number of dimensions for SVMs. After observing the data, the following fields were removed as it may not much affects: a) Nationality: most of customers are from the same country (702) b) OCCUP_CD: most of the data are not there. c) SubPlan_Previous: the author decided to remove this field as there are already a SubPlan_Change_Flag and most of the customers do not change the plan. d) NUM_DELINQ_TEL, PAY_METD, PAY_METD_PREV, PAY_METD_CHG, and HS_CHANGE: data are not useful or relevant. e) HS_MANUFACTURER: removed as there is already a handset model field. f) BLACK_LIST_FLAG, TELE_CHANGE_FLAG, and ID_CHANGE_FLAG: the data in all records are quite unchanged. 2. Transforming: a) Input to SVMs: As input to SVMs must be in numeric form, all the data are needed to be transformed. Each of the data fields was represented by a feature number and the feature value represented the value of each field. The input to SVMs must be in the following format: <line> = <target> <feature>:<value> <feature>:<value>….<feature>:<value> #info <target> = +1 | -1 | 0 | <float> <feature> = <integer> <value> = <float> <info> = <string> The target value and each of the feature/value pairs are separated by a space character. Feature/value pair must be in the increasing order of feature numbers. Features with value zero can be skipped. Example of a training data: +1 1:0.23 3:0.56 8:1 b) Transforming program: A program was written to allocate each of the field a represented number (feature number) and use the value of the data field as a feature number. +1 would be the target value for 3G customers and -1 would be a target value for 2G customers. * Feature value of unmeasured data fields: For some of the data field, we would not be able to give a value number as its values are not numeric, for example, age, gender, marital status… As these types of data are also so important that we can not ignore them, the value of these fields was given a feature number and its value would be equal to 0.5. For example, one 3G customer can be transformed to: +1 1:0.5 4:0.5 15:0.5 20:0.5 26:0.5 32:0.5 39:0.359551 40:0.340561 41:0.484932 42:0.000758 43:0.003082... Feature number 1 represents gender Male, 4 represents marital_status Single… For measured data field, we just need to give its value to the represented feature number and scale to 1. The followings are some of the feature number that the author has given automatically by programming to the data field: 1: Male Gender 40: LINE_TENURE 2: Female Gender 41: DAYSTO_CONTRACT_EXPIRY 3: Married 42: NUM_TEL 4: Single 43: NUM_ACT_TEL 5: Divorced 44: NUM_SUSP_TEL 3. Building Model: SVM Learn of SVM Light was used to build the model for classification purpose. As indicated above, it is better to classify a 2G customer into the 3G customer type than to classify a 3G customer into the 2G customer type. In other words, training errors on 3G positive examples outweighs training errors on 2G negative examples, j parameter (cost factor) of SVM Light Learning module was set to higher than 1. As we already know that there will be 1000 3G customers in the dataset, the author decided to use all of these 18000 sets of data as training data for the learning model. Therefore, these 18000 sets of customer data were transformed into numeric form above and passed to SVM Light Learning module. 4. Prediction: SVM Classify module of SVM Light was used to predict the new examples. There are 2 parameters that we can set here in order to achieve the best predicted 1000 3G customers: a) The value number of unmeasured feature (gender, married status) b) The j parameter (cost factor) of SVM Light. After changing the values of these 2 parameters, the author set value number of unmeasured feature equal to 0.4 and j parameter equal to 1.5 With these values, after classification, the author had 1086 3G customers and 4914 2G customers. Discussion: The followings are some of the data returned from SVM Light Classification module. 1) -1.2406089 2) -0.28477878 3) -1.5060007 4) -1.6569823 5) 1.2099758 6) 0.58005892 7) -0.4054586 8) -0.2487679 9) 0.9694503 10) -0.88280532 11) -0.3529557 12) 0.42899052 13) -1.0195925 14) -1.1603308 15) 1.232341 potential potential Values higher than 0 is 3G customers, otherwise is 2G customers. From this model, we can easily predict which customers have the potential to switch to using 3G network. These customers are those having values slightly less than 0 (e.g. customer 2 and 8)