Applying Support Vector Machines to Imbalanced Datasets

advertisement
Applying Support Vector Machines to
Imbalanced Datasets
Authors:
Rehan Akbani, Stephen Kwek
(University of Texas at San Antonio, USA)
Nathalie Japkowicz
(University of Ottawa, Canada)
Published: European Conference on Machine
Learning (ECML), 2004
Presenter: Rehan Akbani
Home Page: http://www.cs.utsa.edu/~rakbani/
Presentation Outline








Motivation and Problem Definition
Key Issues
Support Vector Machines Background
Problem in Detail
Traditional Approaches to Solve the Problem
Our Approach
Results and Conclusions
Future Work and Suggested Improvements
Motivation


Imbalanced datasets are datasets where the
negative instances far outnumber the
positive instances (or vice versa).
Naturally occurring imbalanced datasets:




Gene profiling
Medical diagnosis
Credit card fraud detection
Ratios of negative to positive instances of
100 to 1 are not uncommon.
Key Issues




Traditional algorithms such as SVM,
decision trees, neural networks etc. perform
poorly with imbalanced data.
Accuracy is not a good metric to measure
performance.
Need to improve traditional algorithms so
that they can handle imbalanced data.
Need to define other metrics to measure
performance.
Support Vector Machines
Background
Find the maximum margin boundary that separates the
green and red instances.
Support Vector Machines
Support Vectors
Circled instances are support vectors.
Support Vector Machines
Kernels
Kernels allow non-linear separation of instances.
E.g. Gaussian Kernel
Effects of Imbalance on SVM
1.
Positive (minority) instances lie further
away from the “ideal” boundary.
Effects of Imbalance on SVM
2.
Support vector ratio is imbalanced.
Support vectors are shown in red.
Effects of Imbalance on SVM
3.
Weakness of Soft-Margins. Minimize
the primal Lagrangian:
Lp 
w
2
2
n
n
i 1
i 1
 


n
 C  i   i yi w . xi  b  1   i   ri i
i 1
Compromise between minimization of
total error and maximization of margin.
Effects of Imbalance on SVM
Margin is maximized at the cost of small total error
Traditional Approaches


Oversample the minority class or
undersample the majority class.
Sample distribution is no longer
random – its distribution no longer
approximates the target distribution.


Defense: Sample biased to begin with
With undersampling, we are discarding
instances that may contain valuable
information.
Problem with Undersampling
Before
After
After undersampling, the learned plane estimates the distance of the
ideal plane better but the orientation of the learned plane is no
longer as accurate.
Our Approach – SMOTE with
Different Costs (SDC)

Do not undersample the majority class in
order to retain all the information.

Use Synthetic Minority Oversampling
TEchnique (SMOTE) (Chawla et al, 2002).

Use Different Error Costs (DEC) to push the
boundary away from positive instances
(Veropoulos et al, 1999).
Effect of DEC
Before DEC
After DEC
Effect of SMOTE and DEC – (SDC)
After DEC alone
After SMOTE and DEC
Experiments


Used 10 different UCI datasets.
Compared with four other algorithms:





Regular SVM
Undersampling (US)
Different Error Costs (DEC) alone
SMOTE alone
Used linear, polynomial (degree 2) and Radial
Basis Function (RBF) (γ = 1) kernels.
Metric Used – g-means

Used g-means metric (Kubat et al, 1997). Higher
g-means means better performance:
g  means 




Sens . Spec
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Used by researchers such as Kubat, Matwin, Holte,
Wu, Chang (1997 – 2003) for imbalanced datasets.
Can be computed easily and results can be displayed
compactly. Suitable for use with several datasets and
SVM, where time and space are limited.
Datasets Used - UCI
Dataset
Abalone19
Anneal5
Car3
Glass7
Hepatitis1
Hypothyroid3
Letter26
Segment1
Sick2
Soybean12
Imbalance
1 : 130
1 : 12
1 : 24
1:6
1:4
1 : 39
1 : 26
1:6
1 : 15
1 : 15
Positive Insts.
32
67
69
29
32
95
734
330
231
44
Negative Insts.
4145
831
1659
185
123
3677
19266
1980
3541
639
% Oversampled
1000
400
400
200
100
400
200
100
100
100
Results
SVM
abalone
0
anneal
1
car
0
glass
0.8660254
hepatitis
0.5959695
hypothyroid0.1767767
letter
0.8182931
segment
0.9950372
sick
0
soybean
0.9258201
Mean
0.537792
US
0.6436394
1
0.960104
0.880108
0.7470874
0.8938961
0.9555176
0.9917748
0.7641141
0.9672867
0.880353
SMOTE
0
1
0.9846381
0.877058
0.742021
0.8025625
0.947737
0.9765287
0.4071283
1
0.773767
DEC
0.8064562
1
0.3162278
0.9199519
0.6942835
0.9384492
0.9871834
0.9950372
0.8627879
1
0.852038
SDC
0.7449049
1
0.9846381
0.9405399
0.7682954
0.9581446
0.9816909
0.9783467
0.8695146
1
0.922608
g-means metric for each algorithm and dataset
Results
SVM
1.2
US
SMOTE
DEC
SDC
1
0.6
0.4
0.2
Datase ts
g-means graphs for each algorithm and dataset
ea
n
yb
so
si
ck
t
gm
en
se
r
te
le
t
oi
d
yr
po
th
hy
he
pa
t
iti
s
s
gl
as
r
ca
l
ne
a
an
al
on
e
0
ab
g mean
0.8
Conclusions





Our algorithm (SDC) outperforms all the other four
algorithms. Undersampling is the runner-up.
SDC performs better than undersampling in 9 out of
10 datasets.
It always performs better than or equal to SMOTE.
It performs better than or equal to DEC in 7 out of 10
datasets.
It has similar limitations to that of SMOTE:
 Assumes the space between two positive
neighboring instances is positive.
 Assumes the neighborhood of a positive instance is
positive.
Future Work and Suggested
Improvements




Design a better over sampling technique
that does not assume a convex positive
space.
Evaluate the algorithm on biological
datasets with extremely high degrees of
imbalance (over 10,000 to 1).
Find out if the technique can be extended to
other ML algorithms which have lower
execution time than SVM.
Analyze the robustness of the algorithm
against noisy minority instances.
Questions?
Questions?
Questions?
Questions?
Questions?
Questions?
Download