Classification A classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. Example: we can build a classification model to categorize bank loan applications as either safe or risky. Such analysis can help provide us with a better understanding of the data at large. Many classification methods have been proposed by researchers in machine learning, pattern recognition, and statistics. Most algorithms are memory resident, typically assuming a small data size. Recent data mining research has built on such work, developing scalable classification and prediction techniques capable of handling large amounts of diskresident data. Classification has numerous applications, including fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis. Suppose that the marketing manager wants to predict how much a given customer will spend during a sale at an e-Commerce website This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a class label. This model is a predictor. Regression analysis is a statistical methodology that is most often used for numeric prediction Hence the two terms tend to be used synonymously, although other methods for numeric prediction exist. Classification and numeric prediction are the two major types of prediction problems. “How does classification work?” Data classification is a two-step process consisting of a learning step (where a classification model is constructed) and a classification step (where the model is used to predict class labels for given data). Training data Classification Algorithm Classifier(Model) Test Data Classifier Because the class label of each training tuple is provided, this step is also known as supervised learning, It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance. “What about classification accuracy?” The performance of a model (classifier) can be evaluated by comparing the predicted labels against the true labels of instances. This information can be summarized in a table called a confusion matrix. Above Table depicts the confusion matrix for a binary classification problem. Each entry fij denotes the number of instances from class i predicted to be of class j. For example, f01 is the number of instances from class 0 incorrectly predicted as class 1. The number of correct predictions made by the model is (f11 + f00) The number of incorrect predictions is (f10 + f01). Although a confusion matrix provides the information needed to determine how well a classification model performs, summarizing this information into a single number makes it more convenient to compare the relative performance of different models. This can be done using an evaluation metric such as accuracy, which is computed in the following way: Accuracy= 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔 𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔 The error rate is another related metric, which is defined as follows for binary classification problems: 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒊𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕/𝒘𝒓𝒐𝒏𝒈 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔 Error rate= 𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔 The decision Tree Induction Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (non-lea node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. Basic Algorithm to Build a Decision Tree One of the earliest method is Hunt’s algorithm, which is the basis for many current implementations of decision tree classifiers, including ID3, C4.5, and CART. Hunt Algorithm In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree initially contains a single root node that is associated with all the training instances. If a node is associated with instances from more than one class, it is expanded using an attribute test condition that is determined using a splitting criterion. A child leaf node is created for each outcome of the attribute test condition and the instances associated with the parent node are distributed to the children based on the test outcomes. This node expansion step can then be recursively applied to each child node, as long as it has labels of more than one class. If all the instances associated with a leaf node have identical class labels, then the node is not expanded any further. Each leaf node is assigned a class label that occurs most frequently in the training instances associated with the node. Information Pure Node Impure Node Low Degree Impurity High Degree Impurity Class 0 0 3 1 4 Class1 8 5 8 4 Impurity Measure for a Single Node How to split the Decision tree(Splitting Criterion)? Binary Attributes The test condition for a binary attribute generates two potential out comes, Nominal Attributes Since a nominal attribute can have many values, its attribute test condition can be expressed in two ways, as a multiway split or a binary split. Ordinal Attributes Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can be grouped as long as the grouping. does not violate the order property of the attribute values. Continuous Attributes For continuous attributes, the attribute test condition can be expressed as a comparison test (e.g., A < v) producing a binary split, or as a range query of form vi ≤ A < vi+1, for i = 1, . . . , k, producing a multiway split. Bays Classification Methods Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes’ theorem, described next. Studies compar- ing classification algorithms have found a simple Bayesian classifier known as the na ̈ıve Bayesian classifier to be comparable in performance with decision tree and selected neu- ral network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class-conditional independence. It is made to simplify the computations involved and, in this sense, is considered “na ̈ıve.” Bays Theorem Bayes’ theorem is named after Thomas Bayes Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X. P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is confined to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income. In contrast, P(H) is the prior probability, or a priori probability, of H. For our exam- ple, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X. Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old and earns $40,000. “How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from the given data, as we shall see next. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes’ theorem is P(H|X)= 𝑃(𝑋 |𝐻 )𝑃(𝐻) 𝑃(𝑋)