The Problem:

advertisement
The Problem:
This is a classification problem in Telecom domain. The crux of the problem is to classify
customers based on various attributes into two categories namely 2G and 3G, where the
categories identify the type of network, 2G for second generation telecommunication
network and 3G for mobile communication network. The primary objective of this
exercise is to find out a common pattern of behavior of customers of each network type.
This will be helpful in targeting customers for conversion from 2G to 3G. The training
set provided has customer details categorized into above two classes.
The primary challenge of this task is that the feature set is very large consisting of about
300 features. Besides, the domain is little known at least to us.
Our Approach
As mentioned earlier, this is a completely new domain as far as we are concerned.
Treating it as a classification task we can consider an array of model-based techniques
from decision trees, neural networks and probabilistic approaches. But instead of all
these, we went for case-based reasoning (CBR) approach. CBR follows the k-NN
learning paradigm where the domain knowledge is encoded in the form of similarity
measure. To classify a new instance we find the k closest instances from the training set
and predict the class of the new one based on the nearest neighbors. The description of
the technique we have used for solving the PAKDD 2006 classification problem is as
follows.
There are two phases in our technique. In the first phase, we studied the input training set
and did attributes subset selection. In the second phase, we used the k-NN algorithm for
the classification. We must admit that lack of domain knowledge forced us to choose
parameters purely based on empirical basis.
Feature Selection Phase:
The given input training set has around 250 attributes. All these attributes may not
contribute to the classification process. Also building a classifier considering all the
attributes makes the classification technique computationally expensive. Therefore it is
required to do the attribute subset selection and identify those attributes that are
significant for the classification process. The attribute subset selection is done by
studying the input training set and various techniques have been proposed in the
literature. We used information gain based mechanism to identify the attribute subset.
Classification Phase:
We implemented the distance weighted k-NN algorithm. Only attributes identified in the
first phase were considered and other attributes were ignored. We used simple Euclidean
distance function as the distance function between two attribute values. The average
attribute value was used when the attribute value was absent in the input training set. We
chose an arbitrary value of 10 for K. For each instance from prediction set, we obtained
the k nearest neighbors (retrieved set) from the training set along with the distance
information. Now, the prediction was done by weighting the class of each instance in the
retrieved set inversely to the distance. The class with higher weight is returned as
prediction. The problem is very interesting and requires detailed analysis. Due to nonavailability of time we have implemented a simple technique and its accuracy of the
prediction needs to be compared with other classification techniques. A comparative
study on different classification techniques applied on this problem is required to identify
the best strategy.
Download