Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz (University of Ottawa, Canada) Published: European Conference on Machine Learning (ECML), 2004 Presenter: Rehan Akbani Home Page: http://www.cs.utsa.edu/~rakbani/ Presentation Outline Motivation and Problem Definition Key Issues Support Vector Machines Background Problem in Detail Traditional Approaches to Solve the Problem Our Approach Results and Conclusions Future Work and Suggested Improvements Motivation Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa). Naturally occurring imbalanced datasets: Gene profiling Medical diagnosis Credit card fraud detection Ratios of negative to positive instances of 100 to 1 are not uncommon. Key Issues Traditional algorithms such as SVM, decision trees, neural networks etc. perform poorly with imbalanced data. Accuracy is not a good metric to measure performance. Need to improve traditional algorithms so that they can handle imbalanced data. Need to define other metrics to measure performance. Support Vector Machines Background Find the maximum margin boundary that separates the green and red instances. Support Vector Machines Support Vectors Circled instances are support vectors. Support Vector Machines Kernels Kernels allow non-linear separation of instances. E.g. Gaussian Kernel Effects of Imbalance on SVM 1. Positive (minority) instances lie further away from the “ideal” boundary. Effects of Imbalance on SVM 2. Support vector ratio is imbalanced. Support vectors are shown in red. Effects of Imbalance on SVM 3. Weakness of Soft-Margins. Minimize the primal Lagrangian: Lp w 2 2 n n i 1 i 1 n C i i yi w . xi b 1 i ri i i 1 Compromise between minimization of total error and maximization of margin. Effects of Imbalance on SVM Margin is maximized at the cost of small total error Traditional Approaches Oversample the minority class or undersample the majority class. Sample distribution is no longer random – its distribution no longer approximates the target distribution. Defense: Sample biased to begin with With undersampling, we are discarding instances that may contain valuable information. Problem with Undersampling Before After After undersampling, the learned plane estimates the distance of the ideal plane better but the orientation of the learned plane is no longer as accurate. Our Approach – SMOTE with Different Costs (SDC) Do not undersample the majority class in order to retain all the information. Use Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al, 2002). Use Different Error Costs (DEC) to push the boundary away from positive instances (Veropoulos et al, 1999). Effect of DEC Before DEC After DEC Effect of SMOTE and DEC – (SDC) After DEC alone After SMOTE and DEC Experiments Used 10 different UCI datasets. Compared with four other algorithms: Regular SVM Undersampling (US) Different Error Costs (DEC) alone SMOTE alone Used linear, polynomial (degree 2) and Radial Basis Function (RBF) (γ = 1) kernels. Metric Used – g-means Used g-means metric (Kubat et al, 1997). Higher g-means means better performance: g means Sens . Spec Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Used by researchers such as Kubat, Matwin, Holte, Wu, Chang (1997 – 2003) for imbalanced datasets. Can be computed easily and results can be displayed compactly. Suitable for use with several datasets and SVM, where time and space are limited. Datasets Used - UCI Dataset Abalone19 Anneal5 Car3 Glass7 Hepatitis1 Hypothyroid3 Letter26 Segment1 Sick2 Soybean12 Imbalance 1 : 130 1 : 12 1 : 24 1:6 1:4 1 : 39 1 : 26 1:6 1 : 15 1 : 15 Positive Insts. 32 67 69 29 32 95 734 330 231 44 Negative Insts. 4145 831 1659 185 123 3677 19266 1980 3541 639 % Oversampled 1000 400 400 200 100 400 200 100 100 100 Results SVM abalone 0 anneal 1 car 0 glass 0.8660254 hepatitis 0.5959695 hypothyroid0.1767767 letter 0.8182931 segment 0.9950372 sick 0 soybean 0.9258201 Mean 0.537792 US 0.6436394 1 0.960104 0.880108 0.7470874 0.8938961 0.9555176 0.9917748 0.7641141 0.9672867 0.880353 SMOTE 0 1 0.9846381 0.877058 0.742021 0.8025625 0.947737 0.9765287 0.4071283 1 0.773767 DEC 0.8064562 1 0.3162278 0.9199519 0.6942835 0.9384492 0.9871834 0.9950372 0.8627879 1 0.852038 SDC 0.7449049 1 0.9846381 0.9405399 0.7682954 0.9581446 0.9816909 0.9783467 0.8695146 1 0.922608 g-means metric for each algorithm and dataset Results SVM 1.2 US SMOTE DEC SDC 1 0.6 0.4 0.2 Datase ts g-means graphs for each algorithm and dataset ea n yb so si ck t gm en se r te le t oi d yr po th hy he pa t iti s s gl as r ca l ne a an al on e 0 ab g mean 0.8 Conclusions Our algorithm (SDC) outperforms all the other four algorithms. Undersampling is the runner-up. SDC performs better than undersampling in 9 out of 10 datasets. It always performs better than or equal to SMOTE. It performs better than or equal to DEC in 7 out of 10 datasets. It has similar limitations to that of SMOTE: Assumes the space between two positive neighboring instances is positive. Assumes the neighborhood of a positive instance is positive. Future Work and Suggested Improvements Design a better over sampling technique that does not assume a convex positive space. Evaluate the algorithm on biological datasets with extremely high degrees of imbalance (over 10,000 to 1). Find out if the technique can be extended to other ML algorithms which have lower execution time than SVM. Analyze the robustness of the algorithm against noisy minority instances. Questions? Questions? Questions? Questions? Questions? Questions?