Analysis of the Literature Introduction and Motivation The most important task in the field of Machine Learning is the classification of data into disjoint groups. A machine can learn a new classifier from a set of training examples with existing class labels, or with a set of examples with no class labels. These techniques are known as supervised and unsupervised training respectively. In supervised learning, the machine uses the knowledge it has gained from the assignation of class labels in the training data to classify further data instances. There are a number of common ways to do this, such as the Naïve Bayes Classifier, Rule Induction and Decision Tree Induction. However; these and many other similar approaches seek to minimalise the number of classification errors – they ignore the possible variety in cost of each misclassification, instead assuming that each classification error has equal cost. This is known as cost-insensitive learning. This method of classification would not be practical in a great many, if not most, real world applications. The vast majority of times, there will be one set of classifications it is critical to be sure about, and one where a mistake is not as costly. For example, when classifying patients for cancer, it is far more harmful to get a false negative; that is, identify a patient as healthy when in fact they are at risk of death, than it is to get a false positive – to treat a patient who is actually healthy. Machine Learning approaches which take this into account use cost-sensitive learning. The class imbalance problem is involved in many real-world applications, and occurs when one of the classes vastly outnumbers the others. The cancer example above is also an instance of the class imbalance problem. There are far more people in the world without cancer than there are with it. Theory of Cost-Sensitive Learning Much of the theory of cost-sensitive learning centres on the respective costs of false positive and false negative classifications, and the ‘reward’ of a true positive and a true negative. It is common practice (and simplifies matters) to normalise the costs so that a true positive and true negative simply have cost zero - tables 1 and 2 illustrate this in a binary case. Negative Prediction Positive Prediction Actual Negative C(0,0) C(1,0) Actual Positive C(0,1) C(1,1) Table 1. Matrix showing the respective costs, C(i,j) of each pair of prediction and actuality. Negative Prediction Positive Prediction Actual Negative 0 C(1,0) – C(0,0) Actual Positive C(0,1) – C(1,1) 0 Table 2. 'Normalised' matrix with values of 0 for true positives and true negatives. (This is not the same as regular mathematical normalisation.) These cost values will be supplied beforehand by various means, and are not calculated during the training process, and it is common practice (without any loss of generality) to label the minority class as positive and the majority as negative. Then, given a matrix analogous to the ones above it is possible to calculate the expected cost, π (π|π₯) of classifying a data instance x into the class i (where, in the example above, i=0 is the positive class and i=1 the negative): π (π|π₯) = ∑ π(π|π₯)πΆ(π, π) (1) π Where π(π|π₯) is an estimation of the probability of classifying some instance x into class j. From this point forward, we will discuss purely binary cases, where instances are classified into either a positive or negative category. It follows, then, that an instance of data x will be classified into a positive class if and only if π(0|π₯)πΆ(1,0) ≤ π(1|π₯)πΆ(0,1) Due to the binary nature of the matrix, it follows immediately that π(0|π₯) = 1 − π(1|π₯)and hence it is possible to define a threshold parameter π′ which determines that an instance x will be classified as positive ifπ(1|π₯) ≥ π′ . Then, π′ = πΆ(1,0) πΉπ = πΆ(1,0) + πΆ(0,1) πΉπ + πΉπ (2) From this equation, if a cost-insensitive classifier can produce a posterior probability forπ(1|π₯), it can be made into a cost-sensitive classifier by calculating the threshold parameter π′ as above, and classifying any data instances as positive wheneverπ(1|π₯) ≥ π′ . Not all cost-insensitive classifiers are capable of producing a posterior probability, though (decision tree algorithms like C4.5, for example) since they are designed to accurately, rather than probabilistically, predict the class. These cost-insensitive classifiers are designed to predict the class with a π′ value of 0.5 – the value expected if class imbalance wasn’t an issue. Through means of sampling, it is possible to alter the proportions of the class to ensure an effective threshold of π′ as required. Assuming, as before, that πΆ(1,0) positive values are in the minority, leave them as they are and take a proportion, πΆ(0,1) of the negative values, ignoring the rest. Since the cost of a false positive is (almost always) smaller than the cost of a false negative, this proportion is less than 1. This method of sampling is known as under-sampling, and is effectively equivalent to a method called proportional sampling in which positive and negative samples are chosen at the following ratio: π(1|π₯)πΆ(0,1): π(0|π₯)πΆ(1,0) (3) Methods of Cost-Sensitive Learning There are two main categories of cost-sensitive learning; direct cost-sensitive learning and costsensitive meta-learning. In direct cost-sensitive learning the classifiers are cost-sensitive themselves whereas in cost-sensitive meta-learning a process is devised that takes an existing cost-insensitive classifier and makes it cost-sensitive through various methods; usually sampling or thresholding. These will both be explained in more detail later. Direct Cost-Sensitive Learning In direct cost-sensitive learning, the classification algorithm calculates a threshold value π′ as part of its functionality, thus addressing the class-imbalance problem and also the problem of differing misclassification costs in its standard process. Cost-sensitive algorithms aim to introduce these misclassifications costs into the learning process and utilise them to assist in correct classification. Two well-known algorithms that implement direct cost-sensitive learning are ICET and CSTree. The former utilises a genetic algorithm to create a population, which it then uses as part of a decision tree algorithm, using the average classification cost as the fitness function of the genetic algorithm. The latter uses misclassification cost as part of the process of creating the tree. Instead of minimising attribute entropy, CSTree selects an attribute based on minimising misclassification cost. Cost-Sensitive Meta-Learning Cost-sensitive meta-learning utilises existing cost-insensitive algorithms without altering them, instead pre- and post-processing the inputs and outputs to these cost-insensitive algorithms, ensuring that the system as a whole is cost-sensitive. Cost-sensitive meta-learning algorithms are generally split into two categories – thresholding and sampling, making use of methods (2) and (3) respectively. Providing the cost-insensitive classifier can generate a posterior probability, thresholding uses the threshold parameter, π′ to classify data instances into positive or negative classes. These thresholding methods rely on the accuracy of the calculated posterior probability estimations, and it is therefore important the estimations be as reliable as possible. MetaCost uses decision trees in an attempt to find an accurate estimation of the required posterior probabilities of training examples before using the information gathered to re-label the training examples into a cost-sensitive classifier. One of the more intuitive approaches to thresholding is Cost-Sensitive Naïve Bayes. Since the Naïve Bayes algorithm already calculates posterior probabilities, it is a simple matter to utilise the threshold parameter to classify data instances. Sampling pre-processes the training examples in accordance with (3), under-sampling the data set and the applying cost-insensitive classifiers on the sampled data. Estimations of probabilities are unnecessary, providing positive and negative examples can be classified with a reasonable degree of accuracy. Conclusion Three differing methods of cost-sensitive learning have been covered – direct cost-sensitive learning, thresholding and sampling. All three of these methods aim to combat problems in the standard cost-insensitive algorithms that occur when misclassification costs are not uniform and class membership is imbalanced. When faced with a vast majority of negative samples, and a minority of positive ones, costinsensitive methods tend to predict almost all examples they are presented with as negative. Whilst this may seem an immediate problem, it has actually been shown that – providing misclassification costs are uniform and the primary objective for the classifier is to achieve optimum accuracy of prediction – this is the best thing to do. Therefore, the cost-sensitive methods discussed above only become useful for differing misclassification costs, and have no place in data sets with just imbalanced classes. Investigation suggests that, for the most part, all three methods perform relatively evenly across a wide variety of sample data sets. However, for very small data sets, the over-sampling method, thresholding, performs best, and for large data sets (in the region of over 10,000) the direct cost-sensitive algorithms emerged as clear winners. Between the two size constraints, however, the optimal approach varies drastically from data set to data set. It is of interest to note that ‘wrapper’ based approaches that use sampling as a pre- or post-process of an existing algorithm work just as well, or even better than, the more thorough and technically correct direct cost-sensitive algorithms, despite the inherent disadvantages of both overand under-sampling. The compatibility of sampling methods with a great mean existing costinsensitive algorithms means they can be much easier to use in a variety of situations, and raises questions of whether the direct cost-sensitive algorithms are really necessary if they cannot be made to perform better on small and medium sized data-sets. Data Pre-Processing Data Discretizing Discretizing data is necessary for some types of machine learning algorithms, but it can have benefits even for algorithms where it is not necessary. It works by taking a continuous data set, such as ‘MonthlyIncome’, and organising the available data into categories chosen by the discretization algorithm. There are two types of these algorithms, unsupervised and supervised, and each functions in a way similar to unsupervised and supervised machine learning algorithms. The former creates categories without using the class label of the training examples, either separating the values into a set number of categories based on the range of data, or into categories so that each category contains the same number of values. On the other hand, supervised discretization utilises the class labels of data to split the data into categories in a more intuitive and useful way. Supervised discretization is almost always superior and so it is the method that we have used in our pre-processing. We used supervised discretisation to pre-process the data for four of our five algorithms; the superior data categorisation improved the accuracy of the models. It was not necessary for the decision tree model that we used – J48, since that algorithm has an in-built pruning functionality. Sampling Methods of sampling were explored thoroughly in the analysis section, but Weka only has one inbuilt sampling filter – ‘spread subsample’. We applied this to all of the algorithms and it achieved a very slight improvement in accuracy for Naïve Bayes when applied before the discretization. However, the size of our data set made it unreliable and it often caused a drastic decrease in accuracy, and so it wasn’t included as part of our model. However, for smaller data sets undersampling would be much more useful and would almost always be used. Evaluation of Models As advised in the brief, we measured the accuracy of our models by the area under their ROC curve. This area gives the probability that the classifier will assign a higher value to a randomly chosen positive value than a randomly chosen negative value. In an idea world, this probability would be equal to 1, but it suffices that higher values for the area are better than lower values. The optimal values that we achieved for the ROC AUC are as follows: ο· Naïve Bayes – 0.864 ο· ο· ο· ο· Logistic Regression – 0.865 Decision Tree (J48) – 0.842 K-Nearest Neighbour (IBK) – 0.842 Rule Induction (Prism) – 0.835 As can be clearly seen, Naïve Bayes and Logistic Regression performed the best, with Logistic Regression slightly superior. J48, the decision tree algorithm, and IBK, the K-Nearest Neighbour method, were also reasonably efficient. Prism, the Rule Induction algorithm, was the least effective, but not by too large a margin. From just this data, it would appear that Naïve Bayes is a slightly inferior classifier compared to Logistic Regression, but the margin is so small that it’s hard to tell without further data. Therefore, we have used additional statistics generated by running our model on the data set given (in this data set, negative was the minority class and thus a false negative is more expensive than a false positive.): Naïve Bayes: ο· ο· ο· ο· ο· ο· Correctly Classified Instances – 3194 (87.0547%) Number of False Positives – 285 Number of False Negatives – 190 Total Number of Instances – 3669 Kappa Statistic – 0.4953 Root Mean Squared Error – 0.3185 Logistic Regression: ο· ο· ο· ο· ο· ο· Correctly Classified Instances – 8408 (93.4222%) Number of False Positives – 503 Number of False Negatives – 89 Total Number of Instances – 9000 Kappa Statistic – 0.2527 Root Mean Squared Error – 0.2289 These additional data would further support the theory that logistic regression is superior – it has a higher percentage of correct classification, a lower number of false negatives (the expensive misclassification) and a lower RMS error. The only anomaly is in the Kappa Statistic, which would imply that Naïve Bayes is superior. Our values for Kappa are vastly smaller than the corresponding ROC AUC – a typical result since a Kappa Statistic is usually an underestimation of accuracy. However, Kappa values typically correlate very well with ROC AUC; better than they do with any other accuracy measure. Since our Kappa values are unexpectedly different, I would hesitate before using it as a measure of accuracy in this case. Ignoring the Kappa Statistic, then, the evidence points towards Logistic Regression being the optimal model for the data set given – it has the largest ROC AUC, the small number of false negatives and the highest correct classification percentage. It is, of course, impossible to determine which of the five models is the optimal in general, since their performances would change dramatically given different training examples or a different test data set. However, for the training examples available, the Logistic Regression model is optimal. References [1] Chawla, Japkowicz, Kolcz: Editorial: Special Issue on Learning from Imbalanced Data Sets [2] Ling, Sheng: Cost-Sensitive Learning and the Class Imbalance Problem [3] Weiss, McCarthy, Zabar: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? [4] Turney: Cost-Sensitive Classiο¬cation: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm [5] Davis, Jungwoo, Rossbach, Ramadan, Witchel: Cost-Sensitive Decision Tree Learning for Forensic Classiο¬cation [6] [7] Domingos: MetaCost – A General Method for Making Classifiers Cost-Sensitive Lustgarten, Gopalakrishnan, Grover, Visweswaran: Improving Classification Performance with Discretization on Biomedical Datasets