Data Mining, Knowledge Discovery, Classification Methods

advertisement
Ascertaining apropos mining algorithms for Business
Applications
Misha Sheth
Parth Mehta
Prof. Chetashri Bhadane
Dept. of Computer Engineering,
Dept. of Computer Engineering,
D. J. Sanghvi COE, Mumbai
University
D. J. Sanghvi COE, Mumbai
University
Assit.Prof. Dept. of Computer
Engineering, D. J. Sanghvi COE,
Mumbai University
misha.sheth@gmail.com
parthpmehta93@gmail.com
ABSTRACT
With the onset of the information technology era, there is an
increasing trend of enterprises attempting to collect and store
colossal amounts of data. This calls for efficient data mining
technologies to expedite data processing, information retrieval
and subsequent knowledge generation. Since it is difficult to
understand the complexities of data mining, determining the
optimal method from the various data mining applications
becomes of prime importance. To resolve this problem, we
analyze several approaches vis-à-vis their methodology, type
of input parameters, speed of training, ease of modelling as
well as issues specific to each method. This allows for swift
and profitable applications of data mining mechanisms.
Further, leveraging the specific strengths and weaknesses of
these techniques in context of business, we look at two
applications of data mining in the financial area and attempt to
suggest an appropriate method for each of them.
Keywords
Data Mining, Knowledge Discovery, Classification Methods,
Business Decision
INTRODUCTION
With the advent of the Internet and a rising growth in
business, there is a gradual realization that the huge amount of
data collected can be processed and analyzed to help lead to
strategic decision making. This data ought to be converted
into information via a sequential data gathering and mining
process. Once data is collected using various collection
methods, it is cleaned and processed to remove discrepancies.
Numerous data mining approaches can then be applied on this
data to yield desired outcomes. This ultimately culminates
into intelligent decision making. By knowledge discovery in
databases, useful knowledge, discrepancies, and important
information can be drawn out from the database for
investigation from various perspectives.
However, due to the diverse domains in which business
applications are present, it is difficult to precisely realize the
algorithm that must be used to suit its specific category,
requirement and desired outcomes. We adopt the literature
survey to summarize approaches and concepts involved.
Moreover, the selection of a technique requires both
conceptual analysis and operational definition of business
decision and applications. Applications are usually composed
chetashri.bhadane@djsce.ac.
in
of several problems to be solved and in this review; we study
each application by breaking the same into parts to establish a
standard description.
The rest of the paper is organized as follows: We explain
literature review of previous works related to this area
including explanation about various data mining techniques
and the two applications being considered, namely crossselling and segmentation analysis have been presented.
Further, a comparative study of the five data mining
techniques is provided. These include Naïve Bayes, Decision
Tree, Neural Network, Support Vector Machine and Logistic
Regression. Finally, we evaluate the results and conclude the
paper.
LITERATURE REVIEW
People often take data mining as a synonym for a popularly
used term, Knowledge Discovery in Databases, or KDD. It is
also correct to view data mining as simply an important and
crucial step in the process of knowledge discovery in
databases. Selecting a data mining algorithm includes
choosing method(s) to be used for finding patterns in the data
such as deciding which parameters and models may be
appropriate and tallying a particular data mining method with
the comprehensive requirements of the KDD process. The
mining results which match the requirements will be
elucidated and organized, to be considered to be put into
action or be presented to interested sides in the final step. The
concept of data mining possesses all activities and methods
using the collected data to derive implicit information and
evaluating historical records to gain valuable knowledge.
Naïve Bayes
Bayesian classifiers are statistical classifiers. Class
membership probabilities can be predicted using Bayesian
classifiers. They designate the most likely class to a specific
given example as illustrated by its feature vector [3]. Learning
and understanding such classifiers can be hugely simplified by
assuming that features are independent given class, that is,
𝑃(𝑋|𝐶) = ∏𝑛𝑖=1 𝑃( 𝑋𝑖 |𝐶), where 𝑋 = (𝑋1, … , 𝑋𝑛 ) is a feature
vector and 𝐶 is a class.(naïve bayes paper) Naïve Bayes (NB)
probabilistic classifiers are the most commonly used [3]. The
primary idea in this approach is to make use of the joint
probabilities of words and categories to assess the
probabilities of categories for a specific scenario or document.
Coming to the naïve part of Naïve Bayes technique, it is the
assumption of word independence, i.e. the conditional
probability of a word given a category is assumed to be
independent from the conditional probabilities of other words
given that category [5]. This assumption is a primary reason
that allows for the computation of the Naïve Bayes classifiers
to be far more efficient than the exponential complexity of
non-naïve Bayes techniques since it does not use word
combinations as predictors [5].
directional data flow. While a feed forward network circulates
data linearly from input to output, RNs also propagate data
from later processing stages to earlier stages [4]. Further, a
neural network can be configured in a manner that application
of a set of inputs produces the desired set of outputs. A
popular way is to 'train' the neural network by providing it
with teaching patterns and allowing it to change its weights in
accordance to some learning rule. We may categorize the
learning situations as supervised learning where the network
is trained by providing it with input and matching output
patterns or as unsupervised learning in where an (output) unit
is schooled to acknowledge clusters of pattern within the
input. In this paradigm the system is expected to find
statistically salient features of the input population [4].
Decision Tree
Support Vector Machine
A decision tree (DT) is an extremely useful tool for
classification. It is simple and easy to understand and assay.
Furthermore, building the classification model does not
require a lot of time. A decision tree (DT) has a flowchart-like
tree structure, where every internal node denotes a test on an
attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label [5]. The
node at the top of a tree is the root node. While constructing
the tree, measures of attribute selection is employed to find
the attribute which best partitions the tuples into distinct
classes. Information Gain, Gain Ratio, and Gini Index are
popular attribute selection measures. While constructing a DT
for the purpose of classification a crucial factor that needs to
be addressed is the degree of adjustment of the model to the
training set being used. If during the construction of DT, a
tight stopping criterion is employed then it leads to small and
underfitted DTs. On the other hand, if a loose stopping
criterion is used, then it leads to generation of large DTs that
over-fit the data of the training set. Pruning methods are
developed to solve this dilemma. Here using a loosely
stopping criterion the DT is allowed to over fit the training
set. Then this over-fitted tree is cut back to a smaller tree by
eliminating sub-branches which do not seem to contribute to
the generalization accuracy. Such Pruning leads to improved
performance [7].
A support vector machine (SVM) is an algorithm which uses a
nonlinear mapping to transform the original training data into
a higher dimension. In this new dimension, it looks for the
linear optimal separating hyper plane. A hyper plane is a
“decision boundary” that separates the tuples of one class
from another. With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes can always
be separated by a hyper plane. The SVM finds this hyper
plane using support vectors (“essential” training tuples) and
margins (defined by the support vectors) [5]. The essential
idea behind support vector machine is illustrated with the
example shown in Figure 1. Here the data is assumed to be
linearly separable. Thus, there exists a linear hyper plane
which separates the points into two different classes. In case
of a two-dimensional model, the hyper plane is a simple
straight line. Figure 1 illustrates two such hyper planes, B1
and B2. Both of them can divide the training examples into
their respective classes without committing any
misclassification errors [5].Even though the training time of
even the fastest SVMs may be extremely slow, they are highly
accurate, particularly due to their ability to model complex
nonlinear decision boundaries. They are much less prone to
over fitting than other methods [5].
Neural Networks
It is a mathematical or computational model based on
biological neural networks. One can think of it to be an
emulation of biological neural mechanism. A typical network
consists of a set of input nodes that are connected to a set of
output nodes through a set of hidden nodes [2].It consists of a
system of interconnected artificial neurons and evaluates
information by employing a connectionist approach to
computation. Often, it is an adaptive system that transforms
its structure based on external or internal knowledge that
progresses through the network during the learning phase [1].
In the feed forward neural network, the information flows in
only the forward direction, from the input nodes, through the
hidden nodes (if any) and to the output nodes [4]. There are
also no cycles or loops in the network. On the other hand
recurrent neural networks (RNs) are models with bi-
Figure 1 an example of a two class problem with two
separating hyper planes.
Regression
Business Decision and Application Analysis
Linear regression (LR) is mainly used to model continuous
valued functions. It is widely used, owing largely to its simple
to use structure. Generalized linear models represent the
theoretical foundation based on which LR can be enforced
upon the modeling of categorical response variables [5].
Common types of generalized linear models include logistic
regression (LogR) and Poisson regression. Logistic
Regression models the probability of an event occurring as a
linear function of a set of predictor variables. Count data
commonly exhibit a Poisson distribution and are usually
modeled using Poisson regression [5].
In this we look at how one may decompose each application
into four parts in order to form a standard description. Those
four concepts are as follows.
1.
Business application activity (e.g. cross-selling).
2.
Processing steps of solving that business problem.
Each step obtains certain derived knowledge which
matches certain pattern from data by investigating
problem characteristics (e.g. customer segmentation
analysis of cross-selling).
3.
Processing characteristics i.e. information which
needs to be assigned or predefined in processing
steps (e.g. customer back-ground data of crossselling).
4.
Processing outcome required for analytical result of
problem processing step (e.g. customer profile of
segmentation analysis).
Business Application
Business application can be broken down into several
business problems. These business problems can be further
divided into problem processing steps and problem
characteristics which are derived from problem descriptions.
The two application studied in this review are cross-selling
and segmentation analysis.
The analysis of the problem is shown in table 1 [6].
Cross-Selling
Table 1: Business Decision and Application Analysis
Cross-selling applications primarily consist of financial
product cross-selling and retail member customer crossselling. There are threefold advantages of cross-selling
strategy. First, targeting customers with those products that
they are highly likely to buy should increase sales and in turn
increase profits. Second, reducing the volume of people
targeted via more selective targeting should reduce costs.
Finally, it is a well-known fact in the financial sector that
loyal customers normally have more than two products on
average; hence, persuading customers to buy more than one
product should increase customer loyalty [6]. In order to
achieve cross-selling effects, knowing which person would be
interested in what product is the key. The overall goal is to
discover characteristics of current customers that can then be
used to mark all other customer segments in order to classify
them into potential promotion targets and unlikely purchasers
[6].
Segmentation Analysis
Segmentation is essentially classifying customers into groups
with identical characteristics like demographic, geographic, or
behavioral traits, and marketing to them as a group. Facing the
market with differing demands, applying market segmentation
strategy can boost the expected returns [6]. A major chunk of
marketing research is concentrated on examining how
variables such as demographics and socioeconomic status can
be employed to predict differences in consumption and brand
loyalty. Segmentation problem should be taken as two
different situations, ‘known character parameters’ and
‘unknown character parameters’ [6]. Character parameters are
known means segmentation analysis deals with customers
who have transactional or behavioral records stored in the
enterprise database and the analytic parameters are predefined
and are derived from analyzer interests [6].
Cross Selling
Processing
Steps
Processing
Characteristics
Processing
Outcome
1.Find
relationships
of
characteristic
Customer
Background
Data
Customer
Profile
2.Match
campaigns to
potential
customers
Segmentation
Analysis
1.Classify
Customers
2.Match
Campaigns
to potential
customers
Customer
Transaction
Data
(Input
Data
unit, discrete
time
sequence)
Customer
Background
Data
(classification
algorithms)
Customer
Background
Data
Prospect
List
Prospect
List
Prospect
List
Table 2: Comparative Study of Classification Algorithms
Naïve Bayes
Decision Tree
Neural Networks
Basic function
Naïve Bayes is a
statistical classifier
and predicts class
membership
probabilities.
Decision tree
learning is a
heuristic, one-step
lookahead (hill
climbing), nonbacktracking
search.
A neural network
contains hidden
interconnections
between input and
output nodes
forming a large
network.
Types of values
It is a continuous
classifier.
It predicts
categorical and
continuous values.
With a suitable
choice of
parameters an
SVM can
separate any
consistent data
set.
Speed of
training and
convergence
It is fast and less
training data
needed.
It is slow. It
takes a lot of
time to train
the data.
-
Ease of
modelling
It is hard to debug
or understand, and
difficult to test.
It is fast. This is
because a decision
tree inherently
"throws away" the
input features that
it doesn't find
useful.
It is easy to
understand, used
for modelling and
visual
representations.
Neural networks are
data-driven selfadaptive methodsthey can adjust
themselves
to the data without
any explicit
specification
It is slow. A neural
network uses all the
input nodes if no
selection is
performed.
It is difficult to
explain,
complicated visual
representation
There is good
accuracy and
power of
flexibility.
Issue of overFitting
It is
computationally
expensive for
datasets with high
dimensional
attributes.
Features are
assumed to be
independent,
normalization
needed.
There exists an
issue of overfitting.
No general methods
to determine the
optimal number of
neurons needed to
solve exist.
It incorporates
capacity
control to
prevent
overfitting
It can update
new data into
model easily;
best when want
to change
classification
thresholds.
-
There is a loss of
outliers by
pruning- have to
tune it by adding
weights
Training is time
consuming and
requires several
passes through the
network.
It is only
directly
applicable for
two-class tasks.
Specific issue
Proposed Explication:
Since cross selling has more emphasis on visual
representation so that other departments like marketing can
draw coherent conclusions from the results, the decision tree
classification algorithms should be used. However, C4.5 must
also be used to prune it to avoid over-fitting and weights must
be added so that outliers are not lost. Segmentation Analysis
also matches campaigns to potential customers; however, the
need for understandable modelling is not as important.
Therefore, the most efficient algorithm to be used would be
neural networks since all kinds of data can be used and
required associations can be formed over the network.
CONCLUSION
Business applications require careful analysis to efficiently
decide the most suitable data mining algorithm to use
Support
Vector
Machine
It is an
algorithm that
uses nonlinear
mapping to
transform data
into a higher
dimension.
Logistic
Regression
It models the
probability of
an event
occurring as a
linear function
of a set of
predictor
variables.
It is used to
model
continuous
valued
functions.
It can be
insensitive to
minute data
according to their characteristics. Classification of customers
and products is of prime importance. Therefore, the data
mining algorithms must be carefully analyzed and used
according to the problem specifics.
REFERENCES
[1] Guoqiang Peter Zhang. 2000. Neural Networks for
Classification: A Survey. IEEE Transactions on System,
Man, and Cybernetics-Part C: Applications and Reviews,
Vol 30, No 4.
[2] Indranil Bose, Radha K. Mahapatra. 2001. Business data
mining-a machine learning perspective. Information &
Management. Elsevier.
[3] I. Rish. An empirical study of the naïve Bayes classifier.
2001. T. J. Watson Research Center.
[4] Dr. Yashpal Singh, Alok Singh Chauhan. 2009. Neural
Networks in Data Mining. Journal of Theoretical and
Applied Information Technology.
[5] Reza Entezari-Maleki, Arash Rezaei and Behrouz
Minaei-Bidgoli. 2009. Comparison of Classification
Methods Based on the Type of Attributes and Sample
Size. Iran University of Science & Technology, Tehran,
Iran.
[6] Jia-Lang Sang, T.C.Chen, 2010. An Analytic approach to
select data mining for business decision. Expert System
with Applications.
[7] Carlos J. Mantas, Joaquin Abellan. 2014. Credal-C4.5:
Decision tree based on imprecise probabilities to classify
noisy data. Expert System with Applications.
Download