Univ01

advertisement
PAKDD 2006 Data Mining Competition
Write-Up
Participant Name: Nguyen Hoang Anh
Problem Summary
An Asian Telco operator which has successfully launched a third generation (3G) mobile
telecommunications network would like to make use of existing customer usage and
demographic data to identify which customers are likely to switch to using their 3G
network.
An original sample dataset of 20,000 2G network customers and 4,000 3G network
customers has been provided with more than 200 data fields. The target categorical
variable is “Customer_Type” (2G/3G). A 3G customer is defined as a customer who has
a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network
compatible mobile phone.
Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is
meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be
made available with the target field missing and is meant to be used for prediction.
The data mining task is a classification problem for which the objective is to accurately
predict as many current 3G customers as possible (i.e. true positives) from the “holdout”
sample provided.
Understanding of the problem:
As the problem already stated, the data mining task is a classification problem for which
the objective is to predict accurately as many current 3G customers as possible.
This classification or prediction task will be done by a model that is generated from
18000 sets of customers which are already classified.
The problem becomes easier when we know that in the prediction data, there are 5000 2G
customers and 1000 3G customers. Therefore, we can control the setting of the algorithm
so that it can come out with the best predicted 1000 3G customers.
In real life, this prediction task can be used for marketing purpose. If the company knows
which customers likely want to switch to using 3G, it can have better marketing strategies
for these targeted customers. That is why it is better to classify a 2G customer into the 3G
customer type than to classify a 3G customer into the 2G customer type.
Approaching the problem:
Support Vector Machines (SVMs) was used for this classification purpose. The following
is the general ideas of the algorithm.
1. Introduction to SVMs:
Support Vector Machines were developed by Vapnik in 1995 based on the
Structural Risk Minimization principle from statistical learning theory.
The idea of structural risk minimization is to find a hypothesis h from a
hypothesis space H for which one can guarantee the lowest probability of
error Err(h) for a given training examples S
(x1,y1)…(xn,yn)
xi € RN, yi € {-1,+1}
For simplicity, let us assume that the training data can be illustrated on a plane
x and can be separated by at least a hyperplane h’
.
Optimal
Hyperplane
δ
Support
vector
δ
δ
This means that there is a weight vector w’ and a threshold b’, so that all
positive training examples are on one side of the hyperplane while all negative
training examples lie on the other side. This is equivalent to requiring
yi[wT.xi+b’]>0 for each training example (xi,yi). In other words, the equation
of the hyperplane which does this separation is:
wT.x+b = 0
so,
wTxi+b ≥ 0 for yi = +1
wTxi+b < 0 for yi = -1
In general, there can be multiple hyperplanes that separate the training data
without error. From these hyperplanes, the Support Vector Machine chooses
the Optimal Hyperspace with largest margin δ.
This particular hyperplane h(x) is shown in the right hand picture. The margin
δ is the distance from the hyperplane to the closed training examples. For each
training example set, there is only one hyperplane with maximum margin. The
examples closest to the hyperplane are called Support Vectors. They have a
distance of exactly δ.
2. SVMLights:
Although there are many implementations of Support Vector Machines (SVMs) in the
market now, SVM Light, an implementation of SVMs in C, seems to be the most
popular for its high precision rate. SVM Light has been used as a basic binary
classifier for this classification task.
SVM Light can be downloaded from here: http://svmlight.joachims.org/
Full technical details of algorithm(s) used
The training and testing data was provided in an Excel sheet that has more than 250 data
fields. Each of these fields was easily represented by a feature number and the feature
value represented the value of each field.
1. Data cleaning and relevance analysis:
Data cleaning refers to the preprocessing of data in order to remove or reduce noise.
As not all the data in 250 fields are useful or relevant, removing some of the fields
would help reducing the number of dimensions for SVMs. After observing the data,
the following fields were removed as it may not much affects:
a) Nationality: most of customers are from the same country (702)
b) OCCUP_CD: most of the data are not there.
c) SubPlan_Previous: the author decided to remove this field as there are already
a SubPlan_Change_Flag and most of the customers do not change the plan.
d) NUM_DELINQ_TEL, PAY_METD, PAY_METD_PREV,
PAY_METD_CHG, and HS_CHANGE: data are not useful or relevant.
e) HS_MANUFACTURER: removed as there is already a handset model field.
f) BLACK_LIST_FLAG, TELE_CHANGE_FLAG, and ID_CHANGE_FLAG:
the data in all records are quite unchanged.
2. Transforming:
a) Input to SVMs:
As input to SVMs must be in numeric form, all the data are needed to be transformed.
Each of the data fields was represented by a feature number and the feature value
represented the value of each field.
The input to SVMs must be in the following format:
<line> = <target>
<feature>:<value>
<feature>:<value>….<feature>:<value> #info

<target> = +1 | -1 | 0 | <float>

<feature> = <integer>

<value> = <float>

<info> = <string>
The target value and each of the feature/value pairs are separated by a space
character. Feature/value pair must be in the increasing order of feature numbers.
Features with value zero can be skipped.
Example of a training data: +1
1:0.23 3:0.56
8:1
b) Transforming program:
A program was written to allocate each of the field a represented number (feature
number) and use the value of the data field as a feature number. +1 would be the
target value for 3G customers and -1 would be a target value for 2G customers.
* Feature value of unmeasured data fields:
For some of the data field, we would not be able to give a value number as its values
are not numeric, for example, age, gender, marital status… As these types of data are
also so important that we can not ignore them, the value of these fields was given a
feature number and its value would be equal to 0.5.
For example, one 3G customer can be transformed to:
+1 1:0.5 4:0.5 15:0.5 20:0.5 26:0.5 32:0.5 39:0.359551 40:0.340561
41:0.484932 42:0.000758 43:0.003082...
Feature number 1 represents gender Male, 4 represents marital_status Single…
For measured data field, we just need to give its value to the represented feature
number and scale to 1.
The followings are some of the feature number that the author has given
automatically by programming to the data field:
 1: Male Gender
 40: LINE_TENURE
 2: Female Gender
 41: DAYSTO_CONTRACT_EXPIRY
 3: Married
 42: NUM_TEL
 4: Single
 43: NUM_ACT_TEL
 5: Divorced
 44: NUM_SUSP_TEL
3. Building Model:
SVM Learn of SVM Light was used to build the model for classification purpose. As
indicated above, it is better to classify a 2G customer into the 3G customer type than
to classify a 3G customer into the 2G customer type. In other words, training errors
on 3G positive examples outweighs training errors on 2G negative examples, j
parameter (cost factor) of SVM Light Learning module was set to higher than 1.
As we already know that there will be 1000 3G customers in the dataset, the author
decided to use all of these 18000 sets of data as training data for the learning model.
Therefore, these 18000 sets of customer data were transformed into numeric form
above and passed to SVM Light Learning module.
4. Prediction:
SVM Classify module of SVM Light was used to predict the new examples. There
are 2 parameters that we can set here in order to achieve the best predicted 1000 3G
customers:
a) The value number of unmeasured feature (gender, married status)
b) The j parameter (cost factor) of SVM Light.
After changing the values of these 2 parameters, the author set value number of
unmeasured feature equal to 0.4 and j parameter equal to 1.5
With these values, after classification, the author had 1086 3G customers and 4914
2G customers.
Discussion:
The followings are some of the data returned from SVM Light Classification module.
1) -1.2406089
2) -0.28477878
3) -1.5060007
4) -1.6569823
5) 1.2099758
6) 0.58005892
7) -0.4054586
8) -0.2487679
9) 0.9694503
10) -0.88280532
11) -0.3529557
12) 0.42899052
13) -1.0195925
14) -1.1603308
15) 1.232341
potential
potential
Values higher than 0 is 3G customers, otherwise is 2G customers. From this model, we
can easily predict which customers have the potential to switch to using 3G network.
These customers are those having values slightly less than 0 (e.g. customer 2 and 8)
Download