Machine_Learning_Analysis

advertisement
Introduction and Motivation
The most important task in the field of Machine Learning is the classification of data into disjoint
groups. A machine can learn a new classifier from a set of training examples with existing class
labels, and then use the knowledge it has gained to classify further data instances. There are a
number of common ways to do this, such as the Naïve Bayes Classifier, Rule Induction and Decision
Tree Induction. However; these and many other similar approaches seek to minimalise the number
of classification errors – they ignore the possible variety in cost of each misclassification, instead
assuming that each classification error has equal cost. This is known as cost-insensitive learning.
This method of classification would not be practical in a great many, if not most, real world
applications. The vast majority of times, there will be one set of classifications it is critical to be sure
about, and one where a mistake is not as costly. For example, when classifying patients for cancer, it
is far more harmful to get a false negative; that is, identify a patient as healthy when in fact they are
at risk of death, than it is to get a false positive – to treat a patient who is actually healthy. Machine
Learning approaches which take this into account use cost-sensitive learning.
The class imbalance problem is involved in many real-world applications, and occurs when one of
the classes vastly outnumbers the others. The cancer example above is also an instance of the class
imbalance problem. There are far more people in the world without cancer than there are with it.
Theory of Cost-Sensitive Learning
Much of the theory of cost-sensitive learning centres on the respective costs of false positive and
false negative classifications, and the ‘reward’ of a true positive and a true negative. It is common
practice (and simplifies matters) to normalise the costs so that a true positive and true negative
simply have cost zero - tables 1 and 2 illustrate this in a binary case.
Negative Prediction
Positive Prediction
Actual Negative
C(0,0)
C(1,0)
Actual Positive
C(0,1)
C(1,1)
Table 1. Matrix showing the respetive costs, C(i,j) of each pair of prediction and actuality.
Negative Prediction
Positive Prediction
Actual Negative
0
C(1,0) – C(0,0)
Actual Positive
C(0,1) – C(1,1)
0
Table 2. 'Normalised' matrix with values of 0 for true positives and true negatives. (This is not the same as regular
mathematical normalisation.)
These cost values will be supplied beforehand by various means, and are not calculated during the
training process, and it is common practice (without any loss of generality) to label the minority class
as positive and the majority as negative. Then, given a matrix analogous to the ones above it is
possible to calculate the expected cost, 𝑅(𝑖|π‘₯) of classifying a data instance x into the class i (where,
in the example above, i=0 is the positive class and i=1 the negative):
𝑅(𝑖|π‘₯) = ∑ 𝑃(𝑗|π‘₯)𝐢(𝑖, 𝑗)
(1)
𝑗
Where 𝑃(𝑗|π‘₯) is an estimation of the probability of classifying some instance x into class j.
From this point forward, we will discuss purely binary cases, where instances are classified into
either a positive or negative category. It follows, then, that an instance of data x will be classified
into a positive class if and only if
𝑃(0|π‘₯)𝐢(1,0) ≤ 𝑃(1|π‘₯)𝐢(0,1)
Due to the binary nature of the matrix, it follows immediately that 𝑃(0|π‘₯) = 1 − 𝑃(1|π‘₯)and hence it
is possible to define a threshold parameter 𝑝′ which determines that an instance x will be classified
as positive if𝑃(1|π‘₯) ≥ 𝑝′ . Then,
𝑝′ =
𝐢(1,0)
𝐹𝑃
=
𝐢(1,0) + 𝐢(0,1)
𝐹𝑃 + 𝐹𝑁
(2)
From this equation, if a cost-insensitive classifier can produce a posterior probability for𝑃(1|π‘₯), it
can be made into a cost-sensitive classifier by calculating the threshold parameter 𝑝′ as above, and
classifying any data instances as positive whenever𝑃(1|π‘₯) ≥ 𝑝′ . Not all cost-insensitive classifiers
are capable of producing a posterior probability, though (decision tree algorithms like C4.5, for
example) since they are designed to accurately, rather than probabilistically, predict the class.
These cost-insensitive classifiers are designed to predict the class with a 𝑝′ value of 0.5 – the value
expected if class imbalance wasn’t an issue. Through means of sampling, it is possible to alter the
proportions of the class to ensure an effective threshold of 𝑝′ as required. Assuming, as before, that
𝐢(1,0)
positive values are in the minority, leave them as they are and take a proportion, 𝐢(0,1) of the
negative values, ignoring the rest. Since the cost of a false positive is (almost always) smaller than
the cost of a false negative, this proportion is less than 1. This method of sampling is known as
under-sampling, and is effectively equivalent to a method called proportional sampling in which
positive and negative samples are chosen at the following ratio:
𝑝(1|π‘₯)𝐢(0,1): 𝑝(0|π‘₯)𝐢(1,0) (3)
Methods of Cost-Sensitive Learning
There are two main categories of cost-sensitive learning; direct cost-sensitive learning and costsensitive meta-learning. In direct cost-sensitive learning the classifiers are cost-sensitive themselves
whereas in cost-sensitive meta-learning a process is devised that takes an existing cost-insensitive
classifier and makes it cost-sensitive through various methods; usually sampling or thresholding.
These will both be explained in more detail later.
Direct Cost-Sensitive Learning
In direct cost-sensitive learning, the classification algorithm calculates a threshold value 𝑝′ as part of
its functionality, thus addressing the class-imbalance problem and also the problem of differing
misclassification costs in its standard process. Cost-sensitive algorithms aim to introduce these
misclassifications costs into the learning process and utilise them to assist in correct classification.
Two well-known algorithms that implement direct cost-sensitive learning are ICET and CSTree. The
former utilises a genetic algorithm to create a population, which it then uses as part of a decision
tree algorithm, using the average classification cost as the fitness function of the genetic algorithm.
The latter uses misclassification cost as part of the process of creating the tree. Instead of
minimising attribute entropy, CSTree selects an attribute based on minimising misclassification cost.
Cost-Sensitive Meta-Learning
Cost-sensitive meta-learning utilises existing cost-insensitive algorithms without altering them,
instead pre- and post-processing the inputs and outputs to these cost-insensitive algorithms,
ensuring that the system as a whole is cost-sensitive. Cost-sensitive meta-learning algorithms are
generally split into two categories – thresholding and sampling, making use of methods (2) and (3)
respectively.
Providing the cost-insensitive classifier can generate a posterior probability, thresholding uses the
threshold parameter, 𝑝′ to classify data instances into positive or negative classes. These
thresholding methods rely on the accuracy of the calculated posterior probability estimations, and it
is therefore important the estimations be as reliable as possible.
MetaCost uses decision trees in an attempt to find an accurate estimation of the required posterior
probabilities of training examples before using the information gathered to re-label the training
examples into a cost-sensitive classifier. One of the more intuitive approaches to thresholding is
Cost-Sensitive Naïve Bayes. Since the Naïve Bayes algorithm already calculates posterior
probabilities, it is a simple matter to utilise the threshold parameter to classify data instances.
Sampling pre-processes the training examples in accordance with (3), under-sampling the data set
and the applying cost-insensitive classifiers on the sampled data. Estimations of probabilities are
unnecessary, providing positive and negative examples can be classified with a reasonable degree of
accuracy.
Conclusion
Three differing methods of cost-sensitive learning have been covered – direct cost-sensitive learning,
thresholding and sampling. All three of these methods aim to combat problems in the standard
cost-insensitive algorithms that occur when misclassification costs are not uniform and class
membership is imbalanced.
When faced with a vast majority of negative samples, and a minority of positive ones, costinsensitive methods tend to predict almost all examples they are presented with as negative. Whilst
this may seem an immediate problem, it has actually been shown that – providing misclassification
costs are uniform and the primary objective for the classifier is to achieve optimum accuracy of
prediction – this is the best thing to do.
Therefore, the cost-sensitive methods discussed above only become useful for differing
misclassification costs, and have no place in data sets with just imbalanced classes. Investigation
suggests that, for the most part, all three methods perform relatively evenly across a wide variety of
sample data sets. However, for very small data sets, the over-sampling method, thresholding,
performs best, and for large data sets (in the region of over 10,000) the direct cost-sensitive
algorithms emerged as clear winners.
Between the two size constraints, however, the optimal approach varies drastically from data set to
data set. It is of interest to note that ‘wrapper’ based approaches to sampling here work just as well,
or even better than, the more thorough and technically correct direct cost-sensitive algorithms,
despite the inherent disadvantages of both over- and under-sampling. The compatibility of sampling
methods with a great mean existing cost-insensitive algorithms means they can be much easier to
use in a variety of situations, and raises questions of whether the direct cost-sensitive algorithms are
really necessary if they cannot be made to perform better on small and medium sized data-sets.
References
[1]
Chawla, Japkowicz, Kolcz: Editorial: Special Issue on Learning from Imbalanced Data Sets
[2]
Ling, Sheng: Cost-Sensitive Learning and the Class Imbalance Problem
[3]
Weiss, McCarthy, Zabar: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling
Unbalanced Classes with Unequal Error Costs?
[4]
Turney: Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree
Induction Algorithm
[5]
Davis, Jungwoo, Rossbach, Ramadan, Witchel: Cost-Sensitive Decision Tree Learning for Forensic
Classification
[6]
Domingos: MetaCost – A General Method for Making Classifiers Cost-Sensitive
Download