Participants are required to email to nnoriel@gmail

advertisement
Summary for PAKDD competition
Submitted by
Data Mining Group
Software and Computing Division, IHPC
Understanding of the problem:
A data set with 249 attributes is given in this classification problem. 18,000 samples with class
labels are used for training and validation. 6000 samples are available for the evaluation of the
classification performance. The purpose of the classification problem is to classify 2G and 3G
customers. The classification model will be used to predict potential 3G customers. The
attributes are the information of existing customers including mobile usage and demographic
data.
Full technical details of algorithm(s) used:
a) Attribute ranking is implemented before training classifiers. Chi-squared Ranking
Filter is used to rank the importance of attributes.
b) Support vector machines are trained as classifiers with the top 10 attributes
c) The linguistic rule extraction algorithm is used to extract rules describing
classification decisions.
The classification model produced:
5 SVM classifiers are generated, and the votes from 5 classifiers used as the final decision to
classify the unlabelled data.
Insights obtained from the classification model:
The attribute subset which is composed of top 10 attributes can give better results
than other attribute subsets for predicting samples (600 3G and 600 2G data are
used for validation).
103 linguistic rules are obtained which are composed of 10 premises in “IF… THEN…” form.
The rules can give 60% accuracy in prediction based on our validation
Download