Document 14127000

advertisement
The Class Imbalane Problem: A Systemati
Study
Nathalie Japkowiz and Sha ju Stephen
Shool of Information Tehnology and Engineering
University of Ottawa
150 Louis Pasteur, P.O. Box 450 Stn. A
Ottawa, Ontario, Canada, B3H 1W5
In mahine learning problems, dierenes in prior lass probabilities|or
lass imbalanes|have been reported to hinder the performane of some standard
lassiers, suh as deision trees. This paper presents a systemati study aimed at
answering three dierent questions. First, we attempt to understand what the lass
imbalane problem is by establishing a relationship between onept omplexity, size
of the training set and lass imbalane level. Seond, we disuss several basi resampling or ost-modifying methods previously proposed to deal with lass imbalanes
and ompare their eetiveness. Finally, we investigate the assumption that the lass
imbalane problem does not only aet deision tree systems but also aets other
lassiation systems suh as Neural Networks and Support Vetor Mahines.
Abstrat
onept learning, lass imbalanes, re-sampling, mislassiation osts,
C5.0, Multi-Layer Pereptrons, Support Vetor Mahines
Keywords:
Introdution
As the eld of mahine learning makes a rapid transition from the status of \aademi
disipline" to that of \applied siene", a myriad of new issues, not previously onsidered
by the mahine learning ommunity, is now oming into light. One suh issue is the lass
imbalane problem. The lass imbalane problem orresponds to the problem enountered by
indutive learning systems on domains for whih one lass is represented by a large number
of examples while the other is represented by only a few.1
The lass imbalane problem is of ruial importane sine it is enountered by a large
number of domains of great environmental, vital or ommerial importane, and was shown,
in ertain ases, to ause a signiant bottlenek in the performane attainable by standard
learning methods whih assume a balaned lass distribution. For example, the problem
ours and hinders lassiation in appliations as diverse as the detetion of oil spills in
satellite radar images (Kubat et al., 98), the detetion of fraudulent telephone alls (Fawett
and Provost, 97), in-ight heliopter gearbox fault monitoring (Japkowiz et al., 95), information retrieval and ltering (Lewis and Catlett, 94) and diagnoses of rare medial onditions
suh as thyroid diseases (Murphy and Aha, 94).
To this point, there have been a number of attempts at dealing with the lass imbalane
problem (Pazzani et al., 94; Japkowiz et al., 95; Ling and Li, 98; Kubat and Matwin, 97;
Fawett and Provost, 97; Kubat et al., 98; Domingos, 99; Chawla et al., 01; Elkan, 01);
However, these attempts were mostly onduted in isolation. In partiular, there has not
been, to date, muh systemati strive to link spei types of imbalanes to the degree
of inadequay of standard lassiers nor are there been many omparisons of the various
methods proposed to remedy the problem. Furthermore, no omparison of the performane
of dierent types of lassiers on imbalaned data sets has yet been performed.2
1 In
this paper, we only onsider the ase of onept-learning. However, the disussion also applies to
multi-lass problems.
2 Two studies attempting to systematize researh on the lass imbalane problem are worth mentioning,
nonetheless: One, urrently in progress at AT&T Lab, links dierent degrees of imbalanes to the performane of C4.5, a deision Tree learning system on a large number of real-world data sets. However, it does
not study the eet of onept omplexity nor training set size in the ontext of their relationship with lass
imbalanes, nor does it look at ways to remedy the lass imbalane problem or the eet of lass imbalanes
on lassiers other than C4.5. The seond study is that by [Lawrene et al., 1998℄, whih does not study the
eet of lass imbalanes on lassiers' performane but whih ompares a number of spei approahes
proposed to deal with lass imbalanes in the ontext of Neural Networks and on a few real-world data sets.
In their study, no lassier other than Neural Networks were onsidered and no systemati study onduted.
2
The purpose of this paper is to address these three onerns in an attempt to unify the
researh onduted on this problem. In a rst part, the paper onentrates on explaining
what the lass imbalane problem is by establishing a relationship between onept omplexity, size of the training set and lass imbalane level. In doing so, we also identify the lass
imbalane situations that are most damaging for a standard lassier that expets balaned
lass distributions. The seond part of the paper turns to the question of how to deal with
the lass imbalane problem. In this part we look at ve dierent methods previously proposed to deal with this problem and, all assumed to be more or less equivalent to eah other.
We attempt to establish to what extent these methods are, indeed, equivalent and to what
extent they dier. The rst two parts of our study were onduted using the C5.0 deision
tree indution system. In the third part, we set out to nd out whether or not the problems
enountered by C5.0 when trained on imbalaned data sets are spei to C5.0. In partiular, we attempt to nd out whether or not the same pattern of hindrane is enountered by
Neural Networks and Support Vetor Mahines and whether similar remedies an apply.
The remainder of the paper is divided into six setions. Setion 2 is an overview of
the paper explaining why the questions we set out to answer are important and how they
will advane our understanding of the lass imbalane problem. Setion 3 desribes the
part of the study fousing on understanding the nature of the lass imbalane problem and
nding out what types of lass imbalane problems reate greater diÆulties for a standard
lassier. Setion 4 desribes the part of the study designed to ompare the ve main types
of approahes previously attempted to deal with the lass imbalane problem. Setion 5
addresses the question of what eet lass imbalanes have on lassiers other than C5.0.
Setions 6 and 7 onlude the paper.
3
Overview of the Paper
As mentioned in the previous setion, the study presented in this paper investigates the
following three series of questions:
Question 1:
What is the nature of the lass imbalane problem? i.e., in what domains do
lass imbalanes most hinder the auray performane of a standard lassier suh as
C5.0?
Question 2:
How do the dierent approahes proposed for dealing with the lass imbalane
problem ompare?
Question 3:
Does the lass imbalane problem hinder the auray performane of lassi-
ers other than C5.0?
These questions are important sine their answers may put to rest urrently assumed but
unproven fats, dispel other unproven beliefs as well as suggest fruitful diretions for future
researh. In partiular, they may help researhers fous their inquiry onto the partiular
type of solution found most promising, given the partiular harateristis identied in their
appliation domain.
Question 1 raises the issue of when lass imbalanes are damaging. While the studies
previously mentioned identied spei domains for whih an imbalane was shown to hurt
the performane of ertain standard lassiers, they did not disuss the questions of whether
imbalanes are always damaging and to what extent dierent types of imbalanes aet
lassiation performanes. This paper takes a global stane and answers these questions
in the ontext of the C5.0 tree indution system on a series of artiial domains spanning a
4
large ombination of harateristis.3
Question 2 onsiders ve related approahes previously proposed by independent researhers for takling the lass imbalane problem4:
1. Upsizing the small lass at random.
2. Upsizing the small lass at \foused" random.
3. Downsizing the large lass at random.
4. Downsizing the large lass at \foused" random.
5. Altering the relative osts of mislassifying the small and the large lasses.
In more detail, Methods 1 and 2 onsist of re-sampling patterns of the small lass (either
ompletely randomly or randomly but within parts of the input spae lose to the boundaries
with the other lass) until there are as many data from the small lass as from the large
one.5 Methods 3 and 4 onsists of eliminating data from the large lass (either ompletely
randomly or, randomly but within parts of the input spae far away from the boundaries
with the large lass) until there are as many data in both lasses. Finally, method 5 onsists
3 The
paper, however, onentrates on domains that present a \between-lass imbalane" in that the
imbalane aets eah subluster of the small lass to the same extent. Beause of lak of spae, the
interesting issue of \within-lass imbalanes"|whih are speial ases of the problem of small disjunts
(Holte, 89)|has been omitted here. This very important question is dealt with elsewhere (Japkowiz, 01).
4 In this study, we fous on disrimination-based approahes to the problem whih base their deisions on
both the positive and negative data. The study of reognition-based approahes whih base their deision
on one of the two lasses but not both has been attempted in (Japkowiz, 00) but did not seem to do as well
as disrimination-based methods (this might be linked, however, to the fat that the reognition threshold
was not hosen very arefully. Nonetheless, we leave it to future work to determine truly whether or not
that is the ase).
5 (Estabrooks, 00) and the AT&T study previously mentioned in Footnote 2 show that, in fat, the
optimal amount of re-sampling is not neessarily that whih yields the same number of data in eah lass.
The optimal amount seems to depend upon the input domain and does not seem easy to estimate a priori.
In order to simplify our study, here, we deided to re-sample until the two lasses are of the same size. This
deision will not alter our results, however, sine we are interested in the relative performane of the dierent
remedial approahes we onsider.
5
of reduing the relative mislassiation ost of the large lass (or, equivalently, inreasing
that of the small one) to make it orrespond to the size of the small lass.
These methods were previously proposed by (Ling and Li, 98; Kubat and Matwin, 97;
Domingos, 99; Chawla et al., 00; and Elkan, 01) but were not systematially ompared
before. Here, we ompare the ve methods, one again, to the data sets used in the previous
part of the paper. This was done to see whether or not the ve approahes for dealing with
lass imbalanes respond to dierent domain harateristis in the same way.
Question 3, nally, asks whether the observations made in answering the previous questions for C5.0 also hold for other lassiers. In partiular, we study the eet of lass
imbalanes on Multi-Layer Pereptrons (MLPs), whih ould be thought of being apable
of more exible learning than C5.0, and thus, be less sensitive to lass imbalanes. We then
repeat this study with Support Vetor Mahines (SVMs) whih ould be believed, not to
be aeted by this problem given that they base their lassiation on a small number of
support vetors and, thus, may not be sensitive to the number of data representing eah
lass. We look at the performane of MLPs and SVMs on a subset of the series of domains
used in the previous part of the paper so as to see whether the three approahes are aeted
by dierent domain harateristis in the same ways.
Question 1: What is the nature of the Class Imbalane
Problem?
In order to answer Question 1, a series of artiial onept-learning domains was generated
that varies along three dierent dimensions: the degree of onept omplexity, the size of
the training set, and the level of imbalane between the two lasses. The standard lassier
6
system tested on this domain in this setion was the C5.0 deision tree indution system
(Quinlan, 93). This lassier has previously been shown to suer from the lass imbalane
problem (e.g., (Kubat et al., 98)), but not in a ompletely systemati fashion. The study
in this setion aims at answering the question of what dierent faes a lass imbalane an
take and whih of these faes hinders C5.0 most.
This part of the paper rst disusses the domain generation proess followed by a report
of the results obtained by C5.0 on the various domains.
Domain Generation
For the experiments of this setion, 125 domains were reated with various ombinations
of onept omplexity, training set size, and degree of imbalane. The generation method
used was inspired by Shaer who designed a similar framework for testing the eet of
overtting avoidane in sparse data sets (Shaer, 93). From Shaer's study, it was lear
that the omplexity of the onept at hand was an important part of the data overtting
problem and, given the relationship between the problem of overtting the data and dealing
with lass imbalanes (see (Kubat et al., 98)), it seems reasonable to assume that, here
again, onept omplexity is an important piee of the puzzle. Similarly, the training set
size should also be a fator in a lassier's ability to deal with imbalaned domains given the
relationship between the data overtting problem and the size of the training set. Finally,
the degree of imbalane is the obvious other parameter expeted to inuene a lassier's
ability to lassify imbalaned domains.
The 125 generated domains of our study were generated in the following way: eah of
the domain is one-dimensional with inputs in the [0, 1℄ range assoiated with one of the two
7
complexity (c) = 3, + = class 1, - = class 0
+
0
-
+
.125 .25 .375
-
+
.5
-
.625
+
.75
-
.875
1
Figure 1: A Bakbone Model of Complexity 3
lasses (1 or 0). The input range is divided into a number of regular intervals (i.e., intervals
of the same size), eah assoiated with a dierent lass value. Contiguous intervals have
opposite lass values and the degree of onept omplexity orresponds to the number of
alternating intervals present in the domain. Atual training sets are generated from these
bakbone models by sampling points at random (using a uniform distribution), from eah of
the intervals. The number of points sampled from eah interval depends on the size of the
domain as well as on its degree of imbalane. An example of a bakbone model is shown in
Figure 1.
Five dierent omplexity levels were onsidered ( = 1::5) where eah level, , orresponds
to a bakbone model omposed of 2 regular intervals. For example, the domains generated
at omplexity level = 1 are suh that every point whose input is in range [0, .5) is assoiated
with a lass value of 1, while every point whose input is in range (.5, 1℄ is assoiated with
a lass value of 0; At omplexity level = 2, points in intervals [0, .25) and (.5, .75) are
assoiated with lass value 1 while those in intervals (.25, .5) and (.75, 1℄ are assoiated with
lass value 0; et., regardless of the size of the training set and its degree of imbalane.6
6 In
this paper, omplexity is varied along a single very simple dimension. Other more sophistiated
models ould be used in order to obtain ner-grained results. In (Estabrooks, 00), for example, a k-DNF
model using several dimensions was used to generate a few artiial domains presenting lass imbalanes.
The study was less systemati than the one in this paper, but it yielded results orroborating those of this
paper.
8
Five training set sizes were onsidered (s = 1::5) where eah size, s, orresponds to a
training set of size round((5000=32) 2s). Sine this training set size inludes all the regular
intervals in the domain, eah regular interval is, in fat, represented by round(((5000=32) 2s )=2) training points (before the imbalane fator is onsidered). For example, at a size
level of s = 1 and at a omplexity level of = 1 and before any imbalane is taken into
onsideration, intervals [0, .5) and (.5, 1℄ are eah represented by 157 examples; If the size
is the same, but the omplexity level is = 2, then eah of intervals [0, .25), (.25, .5), (.5,
.75) and (.75, 1℄ ontains 78 training examples; et.
Finally, ve levels of lass imbalane were also onsidered (i = 1::5) where eah level,
i, orresponds to the situation where eah sub-interval of lass 1 is represented by all the
data it is normally entitled to (given and s), but eah sub-interval of lass 0 ontains only
1=(32=2i)th (rounded) of all its normally entitled data. This means that eah of the subintervals of lass 0 are represented by round((((5000=32) 2s)=2)=(32=2i)) training examples.
For example, for = 1, s = 1, and i = 2, interval [0, .5) is represented by 157 examples and
(.5, 1℄ is represented by 79; If = 2, s = 1 and i = 3, then [0, .25) and (.5, .75) are eah
represented by 78 examples while (.25, .5) and (.75, 1℄ are eah represented by 20; et.
The number of testing points representing eah sub-interval was kept xed (at 50). This
means that all domains of omplexity level = 1 are tested on 50 positive and 50 negative
examples; all domains of omplexity level = 2 are tested on 100 positive and 100 negative
examples; et.
9
Results for Question 1
The results for C5.0 are displayed in Figures 2, 3, 4 and 5 whih plots the error C5.0 obtained
for eah ombination of onept omplexity, training set size, and imbalane level, on the
entire testing set. For eah experiment, we reported four types of results: 1) the orreted
results in whih no matter what degree of lass imbalane is present in the training set, the
ontribution of the false positive error rate is the same as that of the false negative one in the
overall report.7 2) the unorreted results in whih the reported error rate reets the same
imbalane as the one present in the training set.8 3) the false positive error rate; and 4) the
false negative error rate. The orreted and unorreted results are provided so as to take
into onsideration two out of any possible number of situations: one in whih, despite the
presene of an imbalane, the ost of mislassifying the data of one lass is the same as that
of lassifying those of the other lass (the orreted version); the other situation is the one
where the relative ost of mislassifying the two lasses orrespond to the lass imbalane.9
Eah plot in eah of these gures represents the plot obtained at a dierent training set
size. The leftmost plot orresponds to the smallest size (s = 1) and progresses until the
rightmost plot whih orresponds to the largest (s = 5). Within eah of these plots, eah
luster of ve bars represent the onept omplexity level. The leftmost luster orresponds
7 For
this set of results, we simply report the error rate obtained on the testing set orresponding to the
experiment at hand.
8 For this set of results, we modify the ratio of false positive to false negative error obtained on the original
testing set to make it orrespond to the ratio of positive to negative examples in the training set.
9 A more omplete set of results ould have involved omparisons at other relative osts as well. However,
given our large number of experiments, this would have been unmanageable. We thus deided to fous
on two meaningful and important ases only. Similarly, and for the same reasons, we deided not to vary
C5.0's deision threshold aross the ROC spae (Swets et al., 2000). Sine we are seeking to establish the
relative performane of several lassiation approahes we believe that all the results obtained using the
same deision threshold are representative of what would have happened along the ROC urves. We leave
it to future work, however, to verify this assumption.
10
25
25
25
25
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
1
(b) Size=2
Figure 2:
2
3
4
5
0
1
() Size=3
2
3
4
5
0
(d) Size=4
18
18
18
18
16
16
16
16
14
14
14
14
14
12
12
12
12
12
10
10
10
10
10
8
8
8
8
8
6
6
6
6
6
4
4
4
4
4
2
2
2
2
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
1
(b) Size=2
Figure 3:
2
3
4
5
1
() Size=3
2
3
4
5
0
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
4
5
0
(a) Size=1
Figure 4:
2
3
4
5
C5.0 and the Class Imbalane Problem|UnCorreted
25
3
5
(e) Size=5
25
2
1
(d) Size=4
25
1
4
2
0
25
0
3
C5.0 and the Class Imbalane Problem|Correted
16
1
2
(e) Size=5
18
0
1
1
2
3
4
5
0
1
(b) Size=2
2
3
4
5
0
1
() Size=3
2
3
4
5
0
1
(d) Size=4
2
3
4
5
(e) Size=5
C5.0 and the Class Imbalane Problem| False Positive Error Rate
1.4
1.2
0.8
0.45
0.4
0.7
0.4
0.35
0.35
0.3
0.35
0.6
0.3
1
0.25
0.3
0.5
0.25
0.8
0.2
0.25
0.4
0.2
0.2
0.6
0.15
0.3
0.15
0.15
0.4
0.1
0.2
0.1
0.1
0.2
0
0.1
1
2
3
4
(a) Size=1
5
0
1
2
3
4
(b) Size=2
5
0
0.05
0.05
0.05
1
2
3
4
() Size=3
5
0
1
2
3
4
(d) Size=4
5
0
1
2
3
4
5
(e) Size=5
Figure 5: C5.0 and the Class Imbalane Problem| False Negative Error Rate:
Very Close to 0
11
to the simplest onept ( = 1) and progresses until the rightmost one whih orresponds to
the most omplex ( = 5). Within eah luster, nally, eah bar orresponds to a partiular
imbalane level. The leftmost bar orresponds to the most imbalaned level (i = 1) and
progresses until the rightmost bar whih orresponds to the most balaned level (i = 5, or
no imbalane). The height of eah bar represents the average perent error rate obtained by
C5.0 (over ve runs on dierent domains generated from the same bakbone model) on the
omplexity, lass size and imbalane level this bar represents. To make the omparisons easy,
horizontal bars were drawn at every 5% marks. If a graph does not display any horizontal
bars, it is beause all the bars represent an average perent error below 5%, and we onsider
the error negligeable in suh ases.
Our results reveal several points of interest: rst, no matter what the size of the training
set is, linearly separable domains (domains of omplexity level = 1) do not appear sensitive
to any amount of imbalane. As a matter of fat, as the degree of onept omplexity
inreases, so does the system's sensitivity to imbalanes. Indeed, we an learly see both in
Figure 2 (the orreted results) and Figure 3 (the unorreted results) that as the degree of
omplexity inreases, high error rates are aused by lower and lower degrees of imbalanes.
Although the error rates reported in the orreted ases are higher than those reported in
the unorreted ases, the eet of onept omplexity on lass imbalanes is learly visible
in both situations.
A look at Figures 4 and 5 explains the dierene between Figures 2 and 3 sine it reveals
that most of the error represented in these graphs atually ours on the negative testing
set (i.e., most of the errors are false positive errors). Indeed, none of the average perents
of false negative errors over all degrees of onept omplexity and levels of imbalane ever
12
exeed 5%. This is not surprising sine we had expeted the lassier to overt the majority
lass, but the extent to whih it does so might be a bit surprising.
As ould be expeted, imbalane rates are also a fator in the performane of C5.0 and,
perhaps more surprisingly, so is the training set size. Indeed, as the size of the training set
inreases, the degree of imbalane yielding a large error rate dereases. This suggests that in
very large domains, the lass imbalane problem may not be a hindrane to a lassiation
system. Speially, the issue of relative ardinality of the two lasses|whih is often
assumed to be the problem underlying domains with lass imbalaned|may in fat be easily
overridden by the use of a large enough data set (if, of ourse, suh a data set is available
and its size does not prevent the lassier from learning the domain in an aeptable time
frame).
All in all, our study suggests that the imbalane problem is a relative problem depending
on both the omplexity of the onept represented by the data in whih the imbalane ours
and the overall size of the training set, in addition to the degree of lass imbalane present
in the data. In other words, a huge lass imbalane will not hinder lassiation of a domain
whose onept is very easy to learn nor will we see a problem if the training set is very
large. Conversely, a small lass imbalane an greatly harm a very small data set or one
representing a very omplex onept.
Question 2: A Comparison of Various Strategies
Having identied the domains for whih a lass imbalane does impair the auray of a
regular lassier suh as C5.0, this setion now proposes to ompare the main methodologies
that have been proposed to deal with this problem. First, the various shemes used for this
13
omparison are desribed, followed by a omparative report on their performane. In all the
experiments of this setion, one again, C5.0 is used as our standard lassier.
Shemes for Dealing with Class Imbalanes
Over-Sampling
Two oversampling methods were onsidered in this ategory. The rst
one, random oversampling, onsists of oversampling the small lass at random until it ontains as many examples as the other lass. The seond method, foused oversampling, onsists
of oversampling the small lass only with data ourring lose to the boundaries between
the onept and its negation. A fator of = :25 was hosen to represent loseness to the
boundaries.10
Under-Sampling
Two under-sampling methods, losely related to the over-sampling meth-
ods were onsidered in this ategory. The rst one, random undersampling, onsists of eliminating, at random, elements of the over-sized lass until it mathes the size of the other
lass. The seond one, foused undersampling, onsists of eliminating only elements further
away (where, again, = :25 represents loseness to the boundaries)
Cost-Modifying
The ost-modifying method used in this study onsists of modifying
the relative ost assoiated to mislassifying the positive and the negative lass so that it
ompensates for the imbalane ratio of the two lasses. For example, if the data presents
a 1:10 lass imbalane in favour of the negative lass, the ost of mislassifying a positive
example will be set to 9 times that of mislassifying a negative one.
10 This
fator means that for interval [a, b℄, data onsidered lose to the boundary are those in [a, a+
.25 (b-a)℄ and [a+.75 (b-a), b℄. If no data were found in these intervals (after 500 random trials
were attempted), then the data were sampled from the full interval [a, b℄ as in the random oversampling
methodology.
14
25
25
6
5
3
4.5
5
20
2.5
20
4
3.5
4
15
2
15
3
3
10
10
5
5
2.5
1.5
2
2
1
1.5
1
1
0.5
0.5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
1
(b) Size=2
Figure 6:
18
4
16
3.5
14
12
4
5
0
1
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Oversampling: Error Rate, Correted
16
12
3
() Size=3
18
14
2
3.5
2
1.8
3
1.6
3
2.5
1.4
2.5
10
10
8
8
6
6
4
4
1.2
2
2
1
1.5
0.8
1.5
0.6
1
1
2
0
0.4
2
1
2
3
4
5
0
(a) Size=1
0.5
0.5
1
2
3
4
(b) Size=2
Figure 7:
0.2
0
5
1
2
3
4
5
0
() Size=3
1
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Oversampling: Error Rate, Unorreted
Results for Question 2
Like in the previous setion, four series of results are reported in the ontext of eah sheme:
the orreted error, the unorreted error, the false positive error and the false negative error.
The format of the results is the same as that used in the last setion. The results for random
oversampling are displayed in Figures 6 to 9; those for foused oversampling, in Figures 1013; those for random undersampling in Figures 14-17; those for foused undersampling in
Figures 18-21; and those for ost-modifying, in Figures 22-25.
25
25
20
20
15
15
6
4.5
3
4
5
2.5
3.5
4
3
2
2.5
3
1.5
2
10
10
2
5
1.5
1
1
5
1
0.5
0.5
0
1
2
3
4
(a) Size=1
5
0
1
2
3
4
(b) Size=2
Figure 8:
5
0
1
2
3
4
() Size=3
5
0
1
2
3
4
5
(d) Size=4
Oversampling: False Positive Error Rate
15
0
1
2
3
4
(e) Size=5
5
2
1
1.8
0.9
0.7
0.4
1.6
0.8
1.4
0.7
0.5
0.45
0.35
0.6
0.4
0.3
1.2
0.6
1
0.5
0.5
0.35
0.25
0.3
0.4
0.2
0.25
0.3
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.2
0.15
0.15
0.2
0.1
0.1
0.1
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
0
5
(b) Size=2
Figure 9:
18
0.05
1
2
3
4
0
5
() Size=3
0.05
1
2
3
4
0
5
(d) Size=4
1
2
3
4
5
(e) Size=5
Oversampling: False Negative Error Rate
25
6
3
5
2.5
4
2
3
1.5
2
1
1
0.5
1.6
16
1.4
20
14
1.2
12
1
15
10
0.8
8
10
0.6
6
0.4
4
5
0.2
2
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 10:
1
2
3
4
5
0
() Size=3
1
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Foused Oversampling: Error Rates, Correted
The results indiate a number of interesting points. First, all the methods proposed to
deal with the lass imbalane problem present an improvement over C5.0 used without any
type of re-sampling nor ost-modifying tehnique both in the orreted and the unorreted
versions of the results. Nonetheless, not all methods help to the same extent. In partiular, of
all the methods suggested, undersampling is by far the least eetive. This result is atually
at odds with previously reported results (e.g., (Domingos, 99)), but we explain this disparity
by the fat that in the appliations onsidered by (Domingos, 99), the minority lass is the
lass of interest while the majority lass represents everything other than these examples
18
25
6
3
5
2.5
4
2
3
1.5
2
1
1
0.5
1.6
16
1.4
20
14
1.2
12
1
15
10
0.8
8
10
0.6
6
0.4
4
5
0.2
2
0
1
2
3
4
5
0
(a) Size=1
Figure 11:
1
2
3
4
(b) Size=2
5
0
1
2
3
4
() Size=3
5
0
1
2
3
4
(d) Size=4
5
0
1
2
4
(e) Size=5
Foused Oversampling: Error Rates, UnCorreted
16
3
5
25
25
20
8
4.5
7
4
20
2.5
2
3.5
6
3
15
5
15
1.5
2.5
4
2
10
10
5
5
1
3
1.5
2
1
1
0
1
2
3
4
0
5
(a) Size=1
1
2
3
4
5
0.5
0
1
(b) Size=2
Figure 12:
2.5
0.5
2
3
4
5
0
1
() Size=3
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Foused Oversampling: Error Rates, False Positives
12
0.8
0.5
0.7
0.45
0.7
0.6
10
2
0.4
0.6
0.5
0.35
8
0.5
1.5
0.3
0.4
6
0.4
0.25
0.3
1
0.2
0.3
4
0.15
0.2
0.2
0.5
0.1
2
0.1
0.1
0
1
2
3
4
0
5
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 13:
0.05
1
2
3
4
0
5
() Size=3
1
2
3
4
5
0
(d) Size=4
25
25
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
0
0
0
3
4
5
(a) Size=1
1
2
3
4
5
(b) Size=2
Figure 14:
1
2
3
4
5
() Size=3
1
2
3
4
5
0
(d) Size=4
18
18
18
18
16
16
16
16
14
14
14
14
14
12
12
12
12
12
10
10
10
10
10
8
8
8
8
8
6
6
6
6
6
4
4
4
4
4
2
2
2
2
2
3
4
(a) Size=1
5
5
0
1
2
3
4
(b) Size=2
Figure 15:
1
2
3
4
5
Undersampling: Correted Error Rate
16
1
4
(e) Size=5
18
0
3
Foused Oversampling: Error Rates, False Negatives
25
2
2
(e) Size=5
25
1
1
5
0
1
2
3
4
() Size=3
5
0
2
1
2
3
4
5
(d) Size=4
Undersampling: Unorreted Error Rate
17
0
1
2
3
4
(e) Size=5
5
25
25
25
25
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 16:
14
1
2
3
4
5
0
() Size=3
1
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Undersampling: False Positive Error Rate
6
2
3
0.7
1.8
12
5
0.6
2.5
1.6
10
0.5
1.4
4
2
1.2
8
0.4
3
1
1.5
6
0.3
0.8
2
1
0.6
4
0.2
0.4
1
2
0.5
0.1
0.2
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 17:
1
2
3
4
5
0
() Size=3
1
2
3
4
5
0
(d) Size=4
25
25
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
0
0
0
3
4
5
(a) Size=1
1
2
3
4
5
(b) Size=2
Figure 18:
1
2
3
4
5
() Size=3
1
2
3
4
5
0
(d) Size=4
18
18
18
18
16
16
16
16
14
14
14
14
14
12
12
12
12
12
10
10
10
10
10
8
8
8
8
8
6
6
6
6
6
4
4
4
4
4
2
2
2
2
2
3
4
5
0
(a) Size=1
Figure 19:
1
5
1
2
3
4
5
Foused Undersampling: Error Rate, Correted
16
1
4
(e) Size=5
18
0
3
Undersampling: False Negative Error Rate
25
2
2
(e) Size=5
25
1
1
2
3
4
(b) Size=2
5
0
1
2
3
4
() Size=3
5
0
2
1
2
3
4
(d) Size=4
5
0
1
2
4
(e) Size=5
Foused Undersampling: Error Rate, Unorreted
18
3
5
25
25
25
25
25
20
20
20
20
20
15
15
15
15
15
10
10
10
10
10
5
5
5
5
5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 20:
14
1
2
3
4
5
0
() Size=3
1
2
3
4
5
0
1
(d) Size=4
2
3
4
5
(e) Size=5
Foused Undersampling: False Positive Error Rate
1.8
1.8
2
0.5
1.6
1.6
1.8
0.45
1.4
1.4
1.2
1.2
12
10
8
6
1.6
0.4
1.4
0.35
1.2
0.3
1
0.25
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.4
0.1
0.2
0.2
0.2
0.05
0
0
0
4
0.8
0.2
0.6
0.15
2
0
1
2
3
4
5
(a) Size=1
1
2
3
4
5
(b) Size=2
Figure 21:
25
1
2
3
4
5
() Size=3
1
2
3
4
0
5
(d) Size=4
1
2
3
4
5
(e) Size=5
Foused Undersampling: False Negative Error Rate
30
25
4.5
4.5
4
4
3.5
3.5
25
20
20
20
15
3
3
2.5
2.5
15
15
10
2
2
1.5
1.5
10
10
5
5
1
1
0.5
0.5
5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 22:
40
40
35
35
1
2
3
4
5
0
() Size=3
1
2
3
4
0
5
(d) Size=4
1
2
3
4
5
(e) Size=5
Cost Modifying: Correted Error Rate
35
3
9
8
30
2.5
7
30
30
25
2
25
25
20
20
15
15
6
20
5
1.5
4
15
1
3
10
10
10
2
5
0
5
5
1
2
3
4
(a) Size=1
5
0
0.5
1
1
2
3
4
(b) Size=2
Figure 23:
5
0
1
2
3
4
() Size=3
5
0
1
2
3
4
5
(d) Size=4
Cost Modifying: Unorreted Error Rate
19
0
1
2
3
4
(e) Size=5
5
25
25
25
4.5
2.5
4
20
20
20
2
3.5
3
15
15
15
1.5
2.5
2
10
10
10
5
5
5
1
1.5
1
0.5
0.5
0
1
2
3
4
5
0
(a) Size=1
1
2
3
4
5
0
(b) Size=2
Figure 24:
25
1
2
3
4
5
0
1
() Size=3
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Cost Modifying: False Positive Error Rate
30
25
0.35
4.5
4
0.3
25
20
20
3.5
0.25
20
3
15
15
0.2
2.5
15
2
0.15
10
10
10
1.5
0.1
5
1
5
5
0.05
0.5
0
1
2
3
4
5
(a) Size=1
0
1
2
3
4
(b) Size=2
Figure 25:
5
0
1
2
3
4
() Size=3
5
0
1
2
3
4
5
0
(d) Size=4
1
2
3
4
5
(e) Size=5
Cost Modifying: False Negative Error Rate
of interest. It follows that in domains suh as (Domingos, 99)'s the majority lass inludes
a lot of data irrelevant to the lassiation task at hand that are worth eliminating by
undersampling tehniques. In our data sets, on the other hand, the roles of the positive and
the negative lass are perfetly symmetrial and no examples are irrelevant. Undersampling
is, thus, not a very useful sheme in these domains. Foused undersampling does not present
any advantages over random undersampling on our data sets either and neither methods are
reommended in those ases where the two lasses play symmetrial roles and do not ontain
irrelevant data.
The situation, in the ase of oversampling, is quite dierent. Indeed, oversampling is
shown to help quite dramatially at all omplexity and training set size level. Just to
illustrate this fat, onsider, for example, the situation at size 2 and degree of omplexity 4:
while in this ase, any degree of imbalane (other than the ase where no imbalane is present)
auses C5.0 diÆulties (see gures 2(b) and 3(b)), none but the highest degree of imbalane
20
do so when the data is oversampled at random (see gures 6(b) and 7(b)). Contrarily to
the ase of undersampling, the foused approah does make a dierene|albeit, small|in
the ase of oversampling. Indeed, at sizes 1 and 2, foused oversampling deals with the
highest level of omplexity better than random oversampling (ompare the results at degree
of diÆulty 5 in gures 6(a, b) and 7(a, b) on the one hand and gures 10(a, b) and 11(a,
b), on the other hand). Interestingly, the improvement in overall error does not seem to
aet the distribution of the error. Indeed, as Figures 8, 9, 12 and 13 will attest, while
the false positive rate has dereased, the false negative one has not signiantly inreased
despite the fat that the size of the positive training set has inreased dramatially. This is
quite an important result sine it ontradits the expetation that oversampling would have
shifted the error distribution and, thus, not muh helped in the ase where it is essential
to preserve a low false negative error rate while learning the false positive error rate. In
summary, oversampling and foused oversampling seem quite eetive ways of dealing with
the problem, at least in situations suh as those represented in our training set.
The last method, ost-modifying, is more eetive than both random oversampling and
foused oversampling in all but a single observed ase, that of onept omplexity 5 and Size 3
(ompare the results for onept omplexity 5 in gures 6(), 7(), 10() and 11() on the one
hand to those of gures 22() and 23() on the other). In this ase both random and foused
oversampling are more aurate than ost-modifying. The generally better results obtained
with the ost-modifying method over those obtained by oversampling are in agreement with
(Lawrene et al., 98) who suggest that modifying the relative ost of mislassifying eah lass
allows to ahieve the same goals as oversampling without inreasing the training set size, a
step that an harm the performane of a lassier. Nonetheless, although we did not show
21
it here, we assume that in those ases where the majority lass ontains irrelevant examples,
undersampling methods may be more eetive than ost modifying ones.
Question 3: Are other lassiers also sensitive to Class
Imbalanes in the Data?
Setions 1 and 2 studied the question of how lass imbalanes aet lassiation and how
they an be ountered all in the ontext of C5.0, a deision tree indution system. In
this setion, we are onerned about whether lassiation systems using other learning
paradigms are also aeted by the lass imbalane problem and to what extent. In partiular,
we onsider two other paradigms whih, a priori, may seem less prone to hindranes in the
fae of lass imbalanes than deision trees: Multi-Layer Pereptrons (MLPs) and Support
Vetor Mahines (SVMs).
MLPs an be believed to be less prone to the lass imbalane problem beause of their
exibility. Indeed, they may be thought to be able to ompute a less global partition of
the spae than deision tree learning systems sine they get modied by eah data point
sequentially and repeatedly and thus follow a top-down as well as a bottom-up searh of the
hypothesis spae simultaneously. Even more onviningly than MLPs, SVMs an be believed
to be less prone to the lass imbalane problem than C5.0 beause boundaries between lasses
are alulated with respet to only a few support vetors, the data points loated lose to
the other lass. The size of the data set representing eah lass may, thus, be believed not
to matter given suh an approah to lassiation.
The point of this setion is to assess whether indeed MLPs and SVMs are less prone to the
lass imbalane problem and if so, to what extent. Again, we used domains belonging to the
22
same family as the ones used in the previous setion to make this assessment. Nonetheless,
beause MLP and SVM training is muh less time-eÆient than C5.0 training and beause
SVM training was not even possible for large domains on our mahine (beause of a lak of
memory), we did not ondut as extensive a set of experiments as we did in the previous
setions. In partiular, beause of memory restritions, we restrited our study of the eets
of lass imbalanes to domains of size 1 for SVMs (for MLPs, we atually onduted our
study on all sizes for the imbalane study sine we did not have memory problems) and,
beause of low training eÆieny, we only looked at the eet of random oversampling and
undersampling for size 1 on both lassiers.11
MLPs and the Class Imbalane Problem
Beause of the nature of MLPs, more experiments needed to be ran than in the ase of
C5.0. Indeed, beause the performane of MLPs depends upon the number of hidden units
it uses, we experimented with 2, 4, 8 and 16 hidden units and reported only the results
obtained with the optimal network apaity. Other default values were kept xed (i.e., all
the networks were trained by the Levenberg-Marquardt optimization method, the learning
rate was set at 0.01; the networks were all trained for a maximum of 300 epohs or until the
performane gradient desended below 10
10
; and the threshold for disrimination between
the two lasses was set at 0.5). This means that the results are reported a-posteriori (after
heking all the possible network apaities, the best results are reported).
The results are presented in Figures 26, 27, 28 and 29 for onept omplexities =1..5,
11 Unlike
in Question 2 for C5.0, our intent here is not to ompare all possible tehniques for dealing with
the lass imbalane problem with MLPs and SVMs. Instead, we are only hoping to shed some light on
whether these two systems do suer from lass imbalanes and get an idea of whether some simple remedial
methods an be onsidered for dealing with the problem.
23
40
40
40
45
40
35
35
35
40
35
30
30
30
25
25
25
20
20
20
15
15
15
35
30
30
25
25
20
20
15
15
10
10
10
10
10
5
0
5
1
2
3
4
0
5
(a) Size=1
1
2
3
4
0
5
(b) Size=2
Figure 26:
12
10
10
8
5
5
1
2
3
4
5
0
() Size=3
1
2
3
4
0
5
(d) Size=4
1
2
3
4
5
(e) Size=5
MLPs and the Class Imbalane Problem|Correted
12
14
14
14
12
12
12
10
10
10
8
8
8
6
6
6
4
4
4
2
2
8
6
6
4
4
2
0
5
2
1
2
3
4
0
5
(a) Size=1
1
2
3
4
0
5
(b) Size=2
Figure 27:
1
2
3
4
5
0
() Size=3
2
1
2
3
4
0
5
(d) Size=4
1
2
3
4
5
(e) Size=5
MLPs and the Class Imbalane Problem|Unorreted
training set sizes s=1..5, and imbalane levels i=1..5. The format used to report these results
is the same as the one used in the previous two setions.
There are several important dierenes between the results obtained with C5.0 and those
obtained with MLPs. In partiular, in all the MLP graphs a large amount of variane an
be notied in the results despite the fat that all results were averaged over ve dierent
trials. The onlusions derived from these graphs thus should be thought of reeting general
trends rather than spei results. Furthermore, a areful analysis of the graphs reveals that
MLPs do not seem to suer from the lass imbalane problem in the same way as C5.0.
40
40
45
35
35
40
30
30
25
25
20
20
15
15
10
10
5
5
35
30
25
20
15
10
0
1
2
3
4
5
(a) Imbalaned
Figure 28:
0
5
1
2
3
4
5
(b) Oversampling
0
1
2
3
4
5
() Undersampling
Lessening the Class Imbalane Problem in MLP Networks|Correted
24
25
35
45
40
30
20
35
25
30
15
20
25
20
15
10
15
10
10
5
5
5
0
1
2
3
4
5
(a) Imbalaned
0
1
2
3
4
5
(b) Oversampling
0
1
2
3
4
5
() Undersampling
Figure 29:
Lessening the Class Imbalane Problem in MLP Networks|
UnCorreted
Looking, for example, at the graphs for size 1 for C5.0 and MLP (see gures 2(a) and 3(a)
on the one hand and gures 26(a) and 27(a) on the other hand), we see that C5.0 displays
extreme behaviors: it either does a perfet (or lose to perfet) job or it mislassies 25% of
the testing set wrongly (see gure 2(a)). For MLP, this is not the ase and mislassiation
rates span an entire range. As a result, MLP seems less aeted by the lass imbalane
problem than C5.0. For example, for size=1, and onept omplexity 4, C5.0 ran with
imbalane levels 4, 3, 2, and 1 (see gure 2(a)) mislassify 25% of the testing set whereas
MLP (see gure 26(a)) mislassies the full 25% of the testing set for only imbalane levels
2 and 1|the highest degrees of imbalane (some mislassiation also ours at imbalane
levels 3 and 4, but not as drasti as for levels 2 and 1). Note that the diÆulty displayed by
MLPs at onept omplexity 5 for all sizes is probably aused by the fat that, one again
for eÆieny reasons, we did not try networks of apaity greater than 16 hidden units. We,
thus, ignore these results in our disussion.
Another important dierene that an be seen by looking at the graphs for size 5 of
both C5.0 (gures 2(e) and 3(e)) and MLP (gure 26(e) and 27(e)) is that while the overall
size of the training set makes a big dierene in the ase of C5.0, it doesn't make any
dierene for MLP: exept for the highest imbalane levels ombined with the highest degrees
25
of omplexity, C5.0 does not display any notieable error at training set size 5|the highest.
MLPS's on the other hand do. This may be explained by the fat that it is more diÆult
for MLP networks to proess large quantities of data than it is for C5.0.
Beause MLP generally suers from the lass imbalane problem, we asked whether, like
for C5.0, this problem an be lessened by simple tehniques. For the reason of eÆieny
noted earlier and for reasons of oniseness of report, we restrited our experiments to the
ases of random oversampling and random undersampling and to the smallest size (size 1)
ase. The results of these experiments are shown in Figures 28 and 29 whih display the
results obtained with no re-sampling at all (a repeat of gures 26(a) and 27(a)), random
oversampling and random undersampling. Only the orreted and unorreted results are
reported.
The results in these gures show that both oversampling and undersampling have a notieable eet for MLPs, though one again, oversampling seems more eetive. The dierene in eetiveness between undersampling and oversampling, however, is less pronouned
in the ase of MLPs than it was in the ase of C5.0. As a matter of fat, undersampling is
muh less eetive than oversampling for MLP in the most imbalaned ases, but it has omparable eetiveness in all the other ones. This suggests that like for C5.0, simple methods
for ounterating the eet of lass imbalanes should be onsidered when using MLPs.
SVMs and the Class Imbalane Problem
Like for MLPs, more experiments needed to be ran with SVMs than in the ase of C5.0.
Atually, even more experiments were ran with SVMs than with MLPs. We ran SVMs with
a Gaussian Kernel but sine the optimal variane of this kernel is unknown, we tried 10
26
300
300
250
250
200
200
150
150
100
100
50
50
500
450
400
350
300
250
200
150
100
50
0
1
2
3
4
0
5
(a) Imbalaned
1
2
3
4
0
5
(b) Oversampling
1
2
3
4
5
() Undersampling
Figure 30: The Class Imbalane Problem in SVMs|Correted.
40
35
35
40
35
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
1
2
3
4
5
(a) Imbalaned
0
5
1
2
3
4
5
(b) Oversampling
0
1
2
3
4
5
() Undersampling
Figure 31: The Class Imbalane Problem in SVMs|Unorreted.
dierent possible variane values for eah experiment. We experimented with varianes 0.1,
0.2, et. up to 1. We did not experiment with modiations to the soft margin threshold
(note that suh experiments would be equivalent to the ost-modiation experiments of
C5.0). Like for MLPs, the results are reported a-posteriori (after heking the results with
all the possible varianes, we report only the best results obtained).
As mentioned before, beause of problems of memory apaity, the results are reported
for a training set size of 1 and it was not possible to report results similar to those reported
for MLP in Figures 26 and 27. Instead, we report results similar to those of Figures 28 and 29
for MLPs in Figures 30 and 31 for SVMs. In partiular, these gures show results obtained
by SVMs with no resampling at all, random oversampling and random undersampling for
size 1.
The results displayed in Figures 30(a) and 31(a) show that there is a big dierene
between C5.0 and MLPs on the one hand and SVMs on the other. Indeed, while for both
27
C5.0 and MLPs, the leftmost olumn in a luster of olumns|those olumns representing
the highest degree of imbalanes|were higher than the others (gures 2(a) and 26(a)), in the
ase of SVMs (gure 30(a)) the 5 olumns in the lusters display a at value or the leftmost
olumns have lower values than the rightmost ones (see the ase of onept omplexity 4 in
partiular). The unorreted results in Figure 31(a) reet the fat that SVMs are ompletely
insensitive to lass imbalanes and make, on average, as many errors on the positive and
the negative testing set, and are shown to suer if the relative ost of mislassifying the two
lasses is altered in favour of one or the other lass.
This is quite interesting sine it suggests that SVMs are absolutely not sensitive to the
lass imbalane problem (this, by the way, is similar to the property of the deision tree
splitting riterion introdued by (Drummond and Holte, 00)). As a matter of fat, Figures 30
and 31 (b) and () show that oversampling the data at random does not help in any way
and undersampling it at random even hurts the SVM's performane.
All in all, this suggests that when onfronted to a lass imbalane situation, it might
be wise to onsider using SVMs sine they are robust to suh problems. Of ourse, this
should be done only if SVMs fare well on the partiular problem at hand as ompared to
other lassiers. In our domains, for example, up to onept omplexity 3 (inluded), SVMs
(gure 30(a)) are ompetitive with MLPs (gure 28(a)) and only slightly less ompetitive
with oversampled C5.0 (gure 6(a)). At omplexity 4, oversampled MLPs (gure 28(b)) and
C5.0 (gure 6(a)) are more aurate than SVMs (gure 30(a)).
28
Conlusion
The purpose of this paper was to explain the nature of the lass imbalane problem, ompare
various simple strategies previously proposed to deal with the problem and assess the eet
of lass imbalanes on dierent types of lassiers.
Our experiments allowed us to onlude that the lass imbalane problem is a relative
problem that depends on 1) the degree of lass imbalane; 2) the omplexity of the onept
represented by the data; 3) the overall size of the training set; and 4) the lassier involved.
More speially, we found that the higher the degree of lass imbalane the higher the
omplexity of the onept and the smaller the overall size of the training set, the greater the
eet of lass imbalanes in lassiers sensitive to the problem. The three types of lassiers
we tested were not sensitive to the lass imbalane problem in the same way: C5.0 was the
most sensitive of the three, MLPs ame next and displayed a dierent pattern of sensitivity
(a grayer-sale type ompared to C5.0's whih was more ategorial); and SVMs ame last
sine they were shown not to be at all sensitive to this problem.
Finally, for lassiers sensitive to the lass imbalane problem, it was shown that simple
re-sampling methods ould help a great deal whereas they do not help, and in ertain ases,
even hurt the lassier insensitive to lass imbalanes. An extensive and areful study of
the lassier most aeted by lass imbalanes, C5.0, reveals that while random re-sampling
is an eetive way to deal with the problem, random oversampling is a lot more useful
than random undersampling. More \intelligent" oversampling helps even further, but more
\intelligent" undersampling does not. The ost-modifying method seems more appropriate
than the over-sampling and even foused over-sampling method exept in one ase of very
29
high omplexity and medium-range training set size.
Future Work
The work in this paper presents a systemati study of lass imbalane problems on a large
family of domains. Nonetheless, this family does not over all the known harateristis
that a domain may take. For example, we did not study the eet of irrelevant data in the
majority lass. We assume that suh a harateristi should be important sine it it may
make undersampling more eetive than oversampling or even ost-modifying on domains
presenting a large variane in the distribution of the large lass. Other harateristis should
also be studied sine they may reveal other strengths and weaknesses of the remedial methods
surveyed in this study.
In addition several other methods for dealing with lass imbalane problems should be
surveyed. Two approahes in partiular are 1) over-sampling by reation of new syntheti
data points not present in the original data set but presenting similarities to the existing data
points and 2) learning from a single lass rather than from two lasses, trying to reognize
examples of the lass of interest rather than disriminate between examples of both lasses.
Finally, it would be interesting to ombine, in an \intelligent" manner, the various methods previously proposed to deal with the lass imbalane problem. Preliminary work on this
subjet was previously done by (Chawla et al., 01) and (Estabrooks and Japkowiz, 01), but
muh more remains to be done in this area.
Bibliography
Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. SMOTE: Syntheti Minority Oversampling TEhnique International Conferene on Knowledge Based Computer Systems, 2000
30
Domingos, P. Metaost: A General Method for Making Classiers Cost-sensitive. Proeedings of the Fifth ACM SIGKDD International Conferene on Knowledge Disovery and
Data Mining, pp. 155-164
Drummond, Chris and Holte, Robert Exploiting the Cost (In)sensitivity of Deision Tree
Splitting Criteria, Proeedings of the Seventeenth International Conferene on Mahine
Learning, pp. 239-249, 2000.
Elkan, Charles The Foundations of Cost-Sensitive Learning Proeedings of the Seventeenth
International Joint Conferene on Artiial Intelligene, 2001
Estabrook, A. A Combination Sheme for Indutive Learning from Imbalaned Data Sets
MCS Thesis, Faulty of Computer Siene, Dalhousie University, 2000.
Estabrook, A. and Japkowiz, N. A Mixture-of-Experts Framework for Conept-Learning
from Imbalaned Data Sets, Proeedings of the 2001 Intelligent Data Analysis Conferene.,
2001
Tom E. Fawett and Foster Provost Adaptive Fraud Detetion Data Mining and Knowledge
Disovery, 3(1):291{316, 1997.
Holte, R. C. and Aker L. E. and Porter, B. W. Conept Learning and the Problem of
Small Disjunts Proeedings of the Eleventh Joint International Conferene on Artiial
Intelligene, pp. 813-818, 1989
Nathalie Japkowiz, Catherine Myers and Mark Gluk A Novelty Detetion Approah to
Classiation Proeedings of the Fourteenth Joint Conferene on Artiial Intelligene, 518{
523, 1995.
Nathalie Japkowiz Learning from Imbalaned Data Sets: A Comparison of Various Solutions Proeedings of the AAAI'2000 Workshop on Learning from Imbalaned Data Sets,
2000.
Japkowiz, N. Conept-Learning in the Presene of Between-Class and Within-Class Imbalanes Advanes in Artiial Intelligene: Proeedings of the 14th Conferene of the Canadian
Soiety for Computational Studies of Intelligene, pp. 67-77, 2001.
Lawrene, S., Burns, I., Bak, A.D., Tsoi, A.C., Giles, C.L., Neural Network Classiation and Unequal Prior Class Probabilities G. Orr, R.-R. Muller, and R. Caruana, editors,
Triks of the Trade, Leture Notes in Computer Siene State-of-the-Art Surveys, pp. 299314. Springer Verlag, 1998.
Miroslav Kubat and Stan Matwin Addressing the Curse of Imbalaned Data Sets: OneSided Sampling Proeedings of the Fourteenth International Conferene on Mahine Learning, 179{186, 1997.
31
Miroslav Kubat, Robert Holte and Stan Matwin Mahine Learning for the Detetion of
Oil Spills in Satellite Radar Images Mahine Learning, 30:195{215, 1998.
Murphy, P.M., and Aha, D.W. UCI Repository of Mahine Learning Databases. University of California at Irvine, Department of Information and Computer Siene, 1994.
Lewis, D. and Catlett, J. Heterogeneous Unertainty Sampling for Supervised Learning
Proeedings of the Eleventh International Conferene of Mahine Learning, pp. 148-156,
1994.
Charles X. Ling and Chenghui Li Data Mining for Diret Marketing: Problems and Solutions International Conferene on Knowledge Disovery & Data Mining, 1998.
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T. a nd Brunk, C. Reduing Mislassiation Costs Proeedings of the Eleventh International Conferene on Mahine Learning,
217{225, 1994.
David E. Rumelhart, Geo E. Hinton and R. J. Williams Learning Internal Representations by Error Propagation Parallel Distributed Proessing, David E. Rumelhart and J. L.
MClelland (Eds), MIT Press, Cambridge, MA, 318{364, 1986.
Cullen Shaer Overtting Avoidane as Bias Mahine Learning, 10:153{178, 1993.
Swets, J., Dawes, R., and Monahan, J. Better Deisions through Siene Sienti Amerian,
Otober 2000: 82-97.
32
Download