Data Mining, Knowledge Discovery, Classification Methods

Ascertaining apropos mining algorithms for Business Applications Misha Sheth Parth Mehta Prof. Chetashri Bhadane Dept. of Computer Engineering, Dept. of Computer Engineering, D. J. Sanghvi COE, Mumbai University D. J. Sanghvi COE, Mumbai University Assit.Prof. Dept. of Computer Engineering, D. J. Sanghvi COE, Mumbai University misha.sheth@gmail.com parthpmehta93@gmail.com ABSTRACT With the onset of the information technology era, there is an increasing trend of enterprises attempting to collect and store colossal amounts of data. This calls for efficient data mining technologies to expedite data processing, information retrieval and subsequent knowledge generation. Since it is difficult to understand the complexities of data mining, determining the optimal method from the various data mining applications becomes of prime importance. To resolve this problem, we analyze several approaches vis-à-vis their methodology, type of input parameters, speed of training, ease of modelling as well as issues specific to each method. This allows for swift and profitable applications of data mining mechanisms. Further, leveraging the specific strengths and weaknesses of these techniques in context of business, we look at two applications of data mining in the financial area and attempt to suggest an appropriate method for each of them. Keywords Data Mining, Knowledge Discovery, Classification Methods, Business Decision INTRODUCTION With the advent of the Internet and a rising growth in business, there is a gradual realization that the huge amount of data collected can be processed and analyzed to help lead to strategic decision making. This data ought to be converted into information via a sequential data gathering and mining process. Once data is collected using various collection methods, it is cleaned and processed to remove discrepancies. Numerous data mining approaches can then be applied on this data to yield desired outcomes. This ultimately culminates into intelligent decision making. By knowledge discovery in databases, useful knowledge, discrepancies, and important information can be drawn out from the database for investigation from various perspectives. However, due to the diverse domains in which business applications are present, it is difficult to precisely realize the algorithm that must be used to suit its specific category, requirement and desired outcomes. We adopt the literature survey to summarize approaches and concepts involved. Moreover, the selection of a technique requires both conceptual analysis and operational definition of business decision and applications. Applications are usually composed chetashri.bhadane@djsce.ac. in of several problems to be solved and in this review; we study each application by breaking the same into parts to establish a standard description. The rest of the paper is organized as follows: We explain literature review of previous works related to this area including explanation about various data mining techniques and the two applications being considered, namely crossselling and segmentation analysis have been presented. Further, a comparative study of the five data mining techniques is provided. These include Naïve Bayes, Decision Tree, Neural Network, Support Vector Machine and Logistic Regression. Finally, we evaluate the results and conclude the paper. LITERATURE REVIEW People often take data mining as a synonym for a popularly used term, Knowledge Discovery in Databases, or KDD. It is also correct to view data mining as simply an important and crucial step in the process of knowledge discovery in databases. Selecting a data mining algorithm includes choosing method(s) to be used for finding patterns in the data such as deciding which parameters and models may be appropriate and tallying a particular data mining method with the comprehensive requirements of the KDD process. The mining results which match the requirements will be elucidated and organized, to be considered to be put into action or be presented to interested sides in the final step. The concept of data mining possesses all activities and methods using the collected data to derive implicit information and evaluating historical records to gain valuable knowledge. Naïve Bayes Bayesian classifiers are statistical classifiers. Class membership probabilities can be predicted using Bayesian classifiers. They designate the most likely class to a specific given example as illustrated by its feature vector [3]. Learning and understanding such classifiers can be hugely simplified by assuming that features are independent given class, that is, 𝑃(𝑋|𝐶) = ∏𝑛𝑖=1 𝑃( 𝑋𝑖 |𝐶), where 𝑋 = (𝑋1, … , 𝑋𝑛 ) is a feature vector and 𝐶 is a class.(naïve bayes paper) Naïve Bayes (NB) probabilistic classifiers are the most commonly used [3]. The primary idea in this approach is to make use of the joint probabilities of words and categories to assess the probabilities of categories for a specific scenario or document. Coming to the naïve part of Naïve Bayes technique, it is the assumption of word independence, i.e. the conditional probability of a word given a category is assumed to be independent from the conditional probabilities of other words given that category [5]. This assumption is a primary reason that allows for the computation of the Naïve Bayes classifiers to be far more efficient than the exponential complexity of non-naïve Bayes techniques since it does not use word combinations as predictors [5]. directional data flow. While a feed forward network circulates data linearly from input to output, RNs also propagate data from later processing stages to earlier stages [4]. Further, a neural network can be configured in a manner that application of a set of inputs produces the desired set of outputs. A popular way is to 'train' the neural network by providing it with teaching patterns and allowing it to change its weights in accordance to some learning rule. We may categorize the learning situations as supervised learning where the network is trained by providing it with input and matching output patterns or as unsupervised learning in where an (output) unit is schooled to acknowledge clusters of pattern within the input. In this paradigm the system is expected to find statistically salient features of the input population [4]. Decision Tree Support Vector Machine A decision tree (DT) is an extremely useful tool for classification. It is simple and easy to understand and assay. Furthermore, building the classification model does not require a lot of time. A decision tree (DT) has a flowchart-like tree structure, where every internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label [5]. The node at the top of a tree is the root node. While constructing the tree, measures of attribute selection is employed to find the attribute which best partitions the tuples into distinct classes. Information Gain, Gain Ratio, and Gini Index are popular attribute selection measures. While constructing a DT for the purpose of classification a crucial factor that needs to be addressed is the degree of adjustment of the model to the training set being used. If during the construction of DT, a tight stopping criterion is employed then it leads to small and underfitted DTs. On the other hand, if a loose stopping criterion is used, then it leads to generation of large DTs that over-fit the data of the training set. Pruning methods are developed to solve this dilemma. Here using a loosely stopping criterion the DT is allowed to over fit the training set. Then this over-fitted tree is cut back to a smaller tree by eliminating sub-branches which do not seem to contribute to the generalization accuracy. Such Pruning leads to improved performance [7]. A support vector machine (SVM) is an algorithm which uses a nonlinear mapping to transform the original training data into a higher dimension. In this new dimension, it looks for the linear optimal separating hyper plane. A hyper plane is a “decision boundary” that separates the tuples of one class from another. With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyper plane. The SVM finds this hyper plane using support vectors (“essential” training tuples) and margins (defined by the support vectors) [5]. The essential idea behind support vector machine is illustrated with the example shown in Figure 1. Here the data is assumed to be linearly separable. Thus, there exists a linear hyper plane which separates the points into two different classes. In case of a two-dimensional model, the hyper plane is a simple straight line. Figure 1 illustrates two such hyper planes, B1 and B2. Both of them can divide the training examples into their respective classes without committing any misclassification errors [5].Even though the training time of even the fastest SVMs may be extremely slow, they are highly accurate, particularly due to their ability to model complex nonlinear decision boundaries. They are much less prone to over fitting than other methods [5]. Neural Networks It is a mathematical or computational model based on biological neural networks. One can think of it to be an emulation of biological neural mechanism. A typical network consists of a set of input nodes that are connected to a set of output nodes through a set of hidden nodes [2].It consists of a system of interconnected artificial neurons and evaluates information by employing a connectionist approach to computation. Often, it is an adaptive system that transforms its structure based on external or internal knowledge that progresses through the network during the learning phase [1]. In the feed forward neural network, the information flows in only the forward direction, from the input nodes, through the hidden nodes (if any) and to the output nodes [4]. There are also no cycles or loops in the network. On the other hand recurrent neural networks (RNs) are models with bi- Figure 1 an example of a two class problem with two separating hyper planes. Regression Business Decision and Application Analysis Linear regression (LR) is mainly used to model continuous valued functions. It is widely used, owing largely to its simple to use structure. Generalized linear models represent the theoretical foundation based on which LR can be enforced upon the modeling of categorical response variables [5]. Common types of generalized linear models include logistic regression (LogR) and Poisson regression. Logistic Regression models the probability of an event occurring as a linear function of a set of predictor variables. Count data commonly exhibit a Poisson distribution and are usually modeled using Poisson regression [5]. In this we look at how one may decompose each application into four parts in order to form a standard description. Those four concepts are as follows. 1. Business application activity (e.g. cross-selling). 2. Processing steps of solving that business problem. Each step obtains certain derived knowledge which matches certain pattern from data by investigating problem characteristics (e.g. customer segmentation analysis of cross-selling). 3. Processing characteristics i.e. information which needs to be assigned or predefined in processing steps (e.g. customer back-ground data of crossselling). 4. Processing outcome required for analytical result of problem processing step (e.g. customer profile of segmentation analysis). Business Application Business application can be broken down into several business problems. These business problems can be further divided into problem processing steps and problem characteristics which are derived from problem descriptions. The two application studied in this review are cross-selling and segmentation analysis. The analysis of the problem is shown in table 1 [6]. Cross-Selling Table 1: Business Decision and Application Analysis Cross-selling applications primarily consist of financial product cross-selling and retail member customer crossselling. There are threefold advantages of cross-selling strategy. First, targeting customers with those products that they are highly likely to buy should increase sales and in turn increase profits. Second, reducing the volume of people targeted via more selective targeting should reduce costs. Finally, it is a well-known fact in the financial sector that loyal customers normally have more than two products on average; hence, persuading customers to buy more than one product should increase customer loyalty [6]. In order to achieve cross-selling effects, knowing which person would be interested in what product is the key. The overall goal is to discover characteristics of current customers that can then be used to mark all other customer segments in order to classify them into potential promotion targets and unlikely purchasers [6]. Segmentation Analysis Segmentation is essentially classifying customers into groups with identical characteristics like demographic, geographic, or behavioral traits, and marketing to them as a group. Facing the market with differing demands, applying market segmentation strategy can boost the expected returns [6]. A major chunk of marketing research is concentrated on examining how variables such as demographics and socioeconomic status can be employed to predict differences in consumption and brand loyalty. Segmentation problem should be taken as two different situations, ‘known character parameters’ and ‘unknown character parameters’ [6]. Character parameters are known means segmentation analysis deals with customers who have transactional or behavioral records stored in the enterprise database and the analytic parameters are predefined and are derived from analyzer interests [6]. Cross Selling Processing Steps Processing Characteristics Processing Outcome 1.Find relationships of characteristic Customer Background Data Customer Profile 2.Match campaigns to potential customers Segmentation Analysis 1.Classify Customers 2.Match Campaigns to potential customers Customer Transaction Data (Input Data unit, discrete time sequence) Customer Background Data (classification algorithms) Customer Background Data Prospect List Prospect List Prospect List Table 2: Comparative Study of Classification Algorithms Naïve Bayes Decision Tree Neural Networks Basic function Naïve Bayes is a statistical classifier and predicts class membership probabilities. Decision tree learning is a heuristic, one-step lookahead (hill climbing), nonbacktracking search. A neural network contains hidden interconnections between input and output nodes forming a large network. Types of values It is a continuous classifier. It predicts categorical and continuous values. With a suitable choice of parameters an SVM can separate any consistent data set. Speed of training and convergence It is fast and less training data needed. It is slow. It takes a lot of time to train the data. - Ease of modelling It is hard to debug or understand, and difficult to test. It is fast. This is because a decision tree inherently "throws away" the input features that it doesn't find useful. It is easy to understand, used for modelling and visual representations. Neural networks are data-driven selfadaptive methodsthey can adjust themselves to the data without any explicit specification It is slow. A neural network uses all the input nodes if no selection is performed. It is difficult to explain, complicated visual representation There is good accuracy and power of flexibility. Issue of overFitting It is computationally expensive for datasets with high dimensional attributes. Features are assumed to be independent, normalization needed. There exists an issue of overfitting. No general methods to determine the optimal number of neurons needed to solve exist. It incorporates capacity control to prevent overfitting It can update new data into model easily; best when want to change classification thresholds. - There is a loss of outliers by pruning- have to tune it by adding weights Training is time consuming and requires several passes through the network. It is only directly applicable for two-class tasks. Specific issue Proposed Explication: Since cross selling has more emphasis on visual representation so that other departments like marketing can draw coherent conclusions from the results, the decision tree classification algorithms should be used. However, C4.5 must also be used to prune it to avoid over-fitting and weights must be added so that outliers are not lost. Segmentation Analysis also matches campaigns to potential customers; however, the need for understandable modelling is not as important. Therefore, the most efficient algorithm to be used would be neural networks since all kinds of data can be used and required associations can be formed over the network. CONCLUSION Business applications require careful analysis to efficiently decide the most suitable data mining algorithm to use Support Vector Machine It is an algorithm that uses nonlinear mapping to transform data into a higher dimension. Logistic Regression It models the probability of an event occurring as a linear function of a set of predictor variables. It is used to model continuous valued functions. It can be insensitive to minute data according to their characteristics. Classification of customers and products is of prime importance. Therefore, the data mining algorithms must be carefully analyzed and used according to the problem specifics. REFERENCES [1] Guoqiang Peter Zhang. 2000. Neural Networks for Classification: A Survey. IEEE Transactions on System, Man, and Cybernetics-Part C: Applications and Reviews, Vol 30, No 4. [2] Indranil Bose, Radha K. Mahapatra. 2001. Business data mining-a machine learning perspective. Information & Management. Elsevier. [3] I. Rish. An empirical study of the naïve Bayes classifier. 2001. T. J. Watson Research Center. [4] Dr. Yashpal Singh, Alok Singh Chauhan. 2009. Neural Networks in Data Mining. Journal of Theoretical and Applied Information Technology. [5] Reza Entezari-Maleki, Arash Rezaei and Behrouz Minaei-Bidgoli. 2009. Comparison of Classification Methods Based on the Type of Attributes and Sample Size. Iran University of Science & Technology, Tehran, Iran. [6] Jia-Lang Sang, T.C.Chen, 2010. An Analytic approach to select data mining for business decision. Expert System with Applications. [7] Carlos J. Mantas, Joaquin Abellan. 2014. Credal-C4.5: Decision tree based on imprecise probabilities to classify noisy data. Expert System with Applications.

Data Mining, Knowledge Discovery, Classification Methods

Related documents

Products

Support

Data Mining, Knowledge Discovery, Classification Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib