International Journal of Application or Innovation in Engineering & Management... Web Site: www.ijaiem.org Email: , Volume 2, Issue 3, March 2013

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
Attribute Grouping Based Classification for
Knowledge Discovery in Databases
Nikhat Khan1, Dr. Fozia H. Khan2 and Dr. G.S. Thakur3
1,2,3
Maulana Azad National Institute of Technology, Bhopal
ABSTRACT
The huge number of marketing campaigns has diminished the effect on general public. Technologies like data warehousing,
data mining and campaign management software have made new insights where firms can gain competitive advantages.
Through data mining, firms can identify valuable customers which in turn enable them to make practical and knowledge
driven decisions. Different types of approaches are abound in the territory of data mining .In this paper we propose to group
the data as a part of pre-processing before doing the classification methods. We will try out this method on several classifiers to
try out the performance of this method. Classification performance and interpretability are of major importance and it is
intended to keep the resulting rule bases small and comprehensible. The primary model is obtained from the data and
subsequently attribute grouping is applied to improve the performance and also reduce the cost of the model.
Keywords: Attribute Grouping, Classification, Data Mining, Neural Networks
1. INTRODUCTION
The data related to direct marketing campaigns for a period of time produces data and information which is analyzed
by managers for supporting decision making. Analysis becomes too difficult as the data is too huge. Resolution of this
problem led to the development of business intelligence. The global economic crisis paved its way by innovation of new
ideas about financial management and new thoughts and points of view of different people. Support management
decisions.
This paper starts with a brief review of some background on the concepts of business intelligence which is followed by
the methodological approach and then the data availability is described. Then the classification techniques is applied
and evaluated and the knowledge discovered is studied and analyzed in a results part. Subsequently, we draw the
conclusions.
The organizations or enterprises try to promote services and /or products through mass campaigns, or by targeting
specific set of contacts [1]. In the present scenario due to massive competition, positive responses to mass campaigns
are comparatively very low. Alternatively, focusing on specific targets and thus specific marketing campaigns is
becoming more popular due to its efficiency. In paper study by Page and Luding it is observed that directed marketing
has some drawbacks, as it may give rise to a negative attitude towards banks as customers feel that it results in the
intrusion of privacy [2].
Data mining is the process in which huge amounts of data is collected and relevant information is derived from that.
Frawley, Piatetsky-Shapiro, and Matheus (1992) have described data mining as “the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data” [3]. Data mining, the extraction of hidden
predictive information from large databases, is a powerful new technology which has very good potential of helping
companies focusing on the valuable information in their data warehouses. Data mining tools is used for predicting
future trends and thus allowing businesses to make knowledge-driven decisions. Data mining tools can resolve business
queries in no time which was earlier too time consuming.
2. METHODOLOGY
In our experiments we decided to use various classification methods such as decision trees and neural network. The
source data used was related to direct marketing campaigns of a Portuguese banking institution. The marketing
campaigns were based on phone calls in order to attain the information if bank term deposit would be subscribed or not.
This data was obtained from: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing [4] [5].
Picture 1 describes the standard process for knowledge discovery in databases (Data Mining). It begins with selection,
followed by pre-processing of data, continued by data transformation, then the data mining task itself and ends with
Evaluation and application of the system.
Volume 2, Issue 3, March 2013
Page 402
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
Picture 1: Knowledge Discovery in Databases
Source Data: It contains the following attributes:
1 - Age
2 - Job: type of job
3 - Marital: marital status4 - education
5 - Default: has credit in default?
6 - Balance: average yearly balance, in Euros
7 - Housing: has housing loan?
8 - Loan: has personal loan?
9 - Contact: contact communication type
10 - Day: last contact day of the month
11 - Month: last contact month of year
12 - Duration: last contact duration, in seconds
13 - Campaign: number of contacts performed during this campaign and for this client
14 - Pdays: number of days that passed by after the client was last contacted from a previous campaign
15 - Previous: number of contacts performed before this campaign and for this client
16 - Poutcome: outcome of the previous marketing campaign
17 – Subscription
[1] Decision Trees
Decision trees classify instances by sorting them based on feature values. The decision tree node represents a feature in
an instance which is to be classified [6]. Similarly each branch represents a value which the node can assume. Instances
are classified beginning at the root node and then sorted based on their feature values. The major approaches available
are breadth-first or depth-first greedy searches. There are different types of Decision trees. Some of them whose
performance measurement readings we have observed in our experiments are briefly described here:
AD Tree: An alternating decision tree is the one which has decision nodes and prediction nodes [7]. Decision nodes
describe a predicate condition while Prediction nodes contain a single number. AD Trees necessarily have prediction
nodes as both root and leaves. An instance is classified by an AD Tree by following all paths for which all decision
nodes are true and summing any prediction nodes which are traversed.
[2] Neural Networks
A neural network is also one of the well known representatives of learning algorithms and it is categorized into two
basic classes – unsupervised and supervised networks. In supervised networks are generally represented by Competitive
layers which automatically lead to finding relationships within the input vectors [8]
Since the processed data is labeled and we already have a prior knowledge of the output classes, we conducted our
experiments on supervised networks, which provide better performance in such cases [9]. There are many types of
supervised networks; the most common one is still Multilayer Perceptron (MLP) networks with back-propagation
learning algorithm.
[3] Attribute-Grouping approach:
In this classification approach the data is grouped by converting the actual discrete values into linguistic terms. The
resulting data is subjected to Alternating Decision tree classification. This will reduce the dimensionality of data.
Although, the boundaries of the discretized intervals are determined by the following IF THEN condition to minimize
the uncertainty of the target attribute, the user may be more interested in the linguistic descriptions of these intervals,
rather in their precise numeric boundaries. Thus, we start with expressing "linguistic ranges" of continuous attributes
Volume 2, Issue 3, March 2013
Page 403
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
as lists of terms that the attributes can take (“high”, “low”, etc.). In our approach the attribute values are converted into
range values or groups and then various classification techniques are applied to it. The data set is divided into a number
of groups based on the range.
The following are the IF condition formulas that will be used to group the data1. =IF(F2>1500, “Very High”, IF(F2>1000, “High”, IF(F2>500, “Medium”, IF(F2>100, “Low”, “Very Low”))))
a. This formula is used to calculate balance.
b. If balance above 1500, then result very high balance.
c. If balance between 1000 and 1500, then high balance.
d. If balance between 500 and 1000, then medium balance.
e. If balance between 500 and 100, then low balance.
g. If balance is below 100, then very low balance.
2. =IF(J2<15,"Early","Late")
a. If contact is made between 1 to 15 days of the month, then early otherwise Late.
3. =IF(60<A2,"Senior Citizen”, IF(25<A2,"Middle Aged”,” Youth"))
a. If age is above 60, then answer is senior citizen.
b. If age is between 25 and 60, then result is middle-aged.
c. If age is below 25, then answer is youth.
4. =IF(L2>180, IF(360>L2,"Medium","Long"),"Short")
a. To classify the call length as long, medium or short.
5. =IF (N2>0, IF (N2>720,"2 years”, IF(N2>360,"1 year”, “Short Time")),"Not contacted")
a. To classify the time since last contacted as 2 years, 1 year back, short time or not contacted (the value has been
rounded to 10).
6. =IF(O2=0,"Not Contacted”, IF(O2>20,"Often","Average"))
a. To classify, the number of times contacted in previous campaigns. Classes are not contacted, often and Average.
7. =IF (M2=1,"Once",IF(M2>10,"Very Frequent”, IF(M2>5,"Often","Average")))
a. Same as above except for during the current campaign and classes changed to once, Very Frequent, Often and
Average.
3. RESULT
We picked up some algorithms to use as a testing base on grouped data and non-grouped data. The various algorithms
used were –
1. AD Tree
2. Decision Stump [10]
3. Random Forest
4. Random Tree
5. REP Tree
6. Multilayer Perceptron [11]
We first accomplished the task of performing the classification task on the grouped data. The results obtained after
completing the mining using all the classifiers are shown in Table 1. Subsequently, we performed the mining on the
ungrouped data to obtain the results as shown in Table 2.
Table 1: Results (Grouped)
Algorithm
Classification Accuracy
AD Tree
89.7075%
Time Required
[S]
4.25
Decision Stump
88.26%
0.09
0.708
Random Forest
87.6948%
3.78
0.838
Random Tree
85.94%
0.28
0.685
REP Tree
88.3559%
1.48
0.776
Multilayer Perceptron
86.9747%
3486.7
0.839
Volume 2, Issue 3, March 2013
ROC Area
0.863
Page 404
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
Table 2: Results (Ungrouped)
Algorithm
Classification Accuracy
Time Required [S]
ROC Area
AD Tree
88.7073%
29.91
0.865
Decision Stump
88.26%
0.72
0.702
Random Forest
89.0563%
11.34
0.871
Random Tree
86.8027%
1.34
0.7
REP Tree
89.0243%
3.16
0.782
Multilayer Perceptron
87.7415
2187.72
0.833
4. DISCUSSION/ ANALYSIS
In our experiments we focused on various methods which provide the information about the exact classification rules.
We have observed that classification of data is generally best obtained from decision trees .Also decision trees provide
classification rules in a format which is easy to visualize and understand. Thus experiments with decision trees can be
considered highly successful. An easy to understand set of rules is obtained from the Alternating decision tree (AD
Tree).
In terms of success, the attribute grouping method proved very successful as it saved a it saved a great deal of time.
This is proved by an observation at the results. We discuss the results of the classification using different classifiers
further below4.1 AD Tree
The most successful, perhaps is the AD tree. It provided pinpoint accuracy although saving on the matter of time. This
is the only algorithm that showed improvement in both, time and accuracy. Besides, it proved more accurate while
mining grouped data rather than ungrouped data. The time spent doing the classification on ungrouped data is almost 7
times then the time spent on doing it on grouped data. The result of classification using this algorithm is shown in
Picture 1.
Picture 2: Alternating Decision Tree, Visualized Form
4.2 Decision Stump
The accuracy of this algorithm remained the same on both the datasets. The thing that changed was time; the time
taken with the grouped data was nearly one –eighth of the time taken on the original data set.
Volume 2, Issue 3, March 2013
Page 405
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
4.3 Random Forest
This algorithm showed decrease in terms of accuracy when it was done on grouped data, although it saved considerably
on time. However, the decrease in accuracy is not as much as the decrease in time, especially when organizations have
large records and very less time to perform data mining.
4.4 Random Tree
This algorithm, like random forest showed decline in terms of accuracy however not as much as the increase in terms
of time saving.
4.5 REP Tree
This tree, like the previous two trees saved on the matter of time while being lenient on matters of classification
accuracy and ROC Area. However it was not as much a difference as the difference in time.
Picture 3 (a): REP Tree Visualized Form, Complete
Picture 3 (b): REP Tree Visualized Form, In Part
4.6 Multilayer Perceptron
This was the only neural network and the only classifier that was not a decision tree using which we performed our
classification using. And also, it was also the only classifier that showed an increase in time while performing on
grouped data. This also showed a decrease in terms of accuracy and ROC area.
Using this approach we got very good results. There was a huge improvement in the performance of the classifiers
overall. For AD tree we found that after using attribute grouping approach the performance improvement was almost
90%. Even the classification accuracy was not showing much difference in most classifiers. This is a huge success of
our experimental investigation.
5. CONCLUSION
From our experiments we concluded that using our approach of attribute grouping method during pre-processing has
proved to be very effective in improving the performance of most classifiers. This is especially true in cases of
organizations having a huge number of records which requires a very large time space and cheap budgets. This method
cuts considerably on time while being almost stable or showing minor fluctuation in terms of accuracy.
Volume 2, Issue 3, March 2013
Page 406
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 3, March 2013
ISSN 2319 - 4847
Therefore it is very useful to perform classification using this approach. The algorithm that remains the most successful
was the alternating decision tree algorithm; therefore this approach is best to use along with the alternating decision
tree algorithm. However, the neural network algorithm multilayer perceptron did not show improvement in either time
or accuracy so it is therefore not very advisable to use this in addition to neural networks.
The amount of time spent in pre-processing is also negligible in comparison to the amount saved by this technique.
Therefore this approach provides a good method for organizations so that they would be able to save time and cost
spent on processing information which is full of numeric data. Further work on this topic could consist of comparison
of a bigger variety of algorithms.
References
[1] Ling, X. and Li, C., 1998. “Data Mining for Direct Marketing: Problems and Solutions”. In Proceedings of the 4th
KDD conference, AAAI Press, 73–79.
[2] Page, C. and Luding, Y., 2003. “Bank manager’s direct marketing dilemmas – customer’s attitudes and purchase
intention”. International Journal of Bank Marketing 21, No.3, 147–163.
[3] Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge discovery in databases: An overview. AI
Magazine: Fall, 213–228.
[4] UCI Machine Learning Repository, "UCI Machine Learning Repository: Bank Marketing Data Set". Available:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
[5] [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An
Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation
and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
[6] Quinlan, J. R., (1986). Induction of Decision Trees. Machine Learning 1: 81-106, Kluwer Academic Publishers
[7] Yoav Freund and Llew Mason. The Alternating Decision Tree Algorithm. Proceedings of the 16th International
Conference on Machine Learning, pages 124-133 (1999)
[8] Haykin, S. O.: Neural Networks and Learning Machines, 3rd edition, Prentice Hall, ISBN 9780131471399, 2009
[9] Fejfar, J., Šťastný, J. Cepl, M. Time series classification using k-Nearest Neighbours, Multilayer Perceptron and
Learning Vector Quantization algorithms. In: Acta Universitatis Agriculturae et Silviculturae
[10] Iba, Wayne; and Langley, Pat (1992); Induction of One-Level Decision Trees, in ML92: Proceedings of the Ninth
International Conference on Machine Learning, Aberdeen, Scotland, 1–3 July 1992, San Francisco, CA: Morgan
Kaufmann, pp. 233–240
[11] Rosenblatt, Frank. x. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan
Books, Washington DC, 1961
AUTHOR
Mrs. Nikhat Khan is an IT professional with over 15 years of experience. She has worked in multiple national and
international organizations. Currently she is working as a Delivery Project Manager and has visited US, UK and
German countries for managing and executing IT projects for renowned international clients. She is currently pursuing
PhD from Maulana Azad National Institute of Technology, Bhopal (MANIT) and is eager to know about the advances
in business intelligence, data mining and fuzzy logic.
Dr. Fozia Haque Khan is Assistant Professor and coordinator of M. Tech (Nanotechnology) at MANIT Bhopal.
Author of a book on Quantum Mechanics, Dr Fozia has guided 26 M.Tech theses and is currently guiding 7 PhDs on
high application materials like oxide nanomaterials for solar cells, gas sensors and other nanomaterials. With over
80 publications in national and international journals and conferences of repute, she has delivered talks in several
national and international conferences, and visited King Abdul Aziz University, Jeddah, S.A, Harvard Universit
Boston USA, and Naval Research Lab, Washington DC (USA).
Dr. Ghanshyam Singh Thakur has received BSc degree from Dr. Hari Singh Gour University Sagar M.P. in 2000. He
has received MCA degree in 2003 from Pt. Ravi Shankar Shukal University Raipur C.G. and PhD degree from
Barkatullah University, Bhopal M.P. in year 2009. He is Assistant Professor in the department of Computer
Applications, Maulana Azad National Institute of technology, Bhopal, M.P. India. He has eight year teaching and
research experience. He has 26 publications in national and international journals. His research interests include Text
Mining, Document clustering, Information Retrieval, Data Warehousing. He is a member of the CSI, IAENG, and
IACSIT.
Volume 2, Issue 3, March 2013
Page 407
Download