Using Support Vectors Machine for Classification of Remotely Sensed Images M. Bundzel, P. Sinčák, N. Kopčo Computational Intelligence Group, Laboratory of AI, Department of Cybernetics and AI, Faculty of Electrical Engineering and Informatics, TU Košice, Slovakia Email:, Abstract: The paper deals with a comparison study of support vector machines classification approach and ARTMAP neural classifiers. SVM provides very interesting mathematical methods based on virtual transformation of input space into a multidimensional space. The high degree of nonlinear discrimination hyperplane is aproximated by task tranformation into dichotmial classification with the aim to achieve the best classification results. SVM was used with RBF kernel and experiments were done on benchmark data as well as on real-world datellite images over Slovakia. Comparisons with Fuzzy Artmap and Gaussian Artmap on these data were accomplished. Adaptive kernel function based on neural network is proposed for future reserach in this area.Classification is evaluated using contigency tables for multiclass classification problems. The aim was to develop a classification tool with the highest accuracy on the tested images. Keywords: Support Vector Machines, VC dimension, RBF kernel function, Fuzzy Artmap, Gaussian Artmap, accuracy assessment, contigency tables 1 Introduction Support Vectors Machines (SVMs) represent a powerfull tool in the field of pattern recognition and regression. Introduced by Vapnik in 1979 SVMs have received increasing attention only in last few years. For further research it is important to evaluate the power of SVMs with different kernels and compare it to existing methods. In this work, Gaussian ARTMAP was choosen to be a competitor of SVM with RBF kernel. The motivation of the project is to acomplish a comparison study between SVM type of classifiers and ARTMAP family classifiers on test and on real world data. 2 Support Vector Machine as a classifier 2.1 Basic description of a SVM principles Following description is possible to find in extended version in [1]. Let us have the training data in the form: {xi , yi } , i 1, ... , l , yi {1,1} , xi R d . Let us assume that there is a hyperplane that separates positive examples from the negative examples (separating hyperplane). The points lying on the hyperplane satisfy w x b 0 , where w is normal to the hyperplane and b w is a perpendicular distance from the hyperplane to the origin ( w is the Euclidean norm of w ). Let d+ and d- be the shortest distance from the separating hyperplane to the closest positive (negative) example. Define the "margin" of the separating hyperplane to be d d - . For, linearly separable case the algorithm looks for the separating hyperplane with the largest margin. This situation is illustrated in the folowing figure. x2 H1 H2 b w w margin x1 Figure 1 Linear separation with SVM, support vectors are circled. Constraints for this optimization problem are following: xi w b 1 for yi 1 (1) and also xi w b 1 for yi 1 (2) y i xi w b 1 0 i (3) or: If the examples are linearly separable, it is always possible to find w and b, such that the inequalities (1) and (2) will hold. Hyperplanes H1 : x w b 1 and H 2 : x w b 1 represents the margin with the width 2 w . H1, and H2 are parallel and no training examples fall between them. Optimal hyperplane will be determined by w and b for which is w 2 maximal (or 2 w minimal) subject to constraints (1) and (2). As can be seen on the Figure 1 it illustrates the typical two dimensional case of an optimal hyperplane. Points lying on one of the hyperplanes H1, H2 are called support vectors. All other points could be removed from the training set and it would not change the solution. The optimization problem is switched to the Lagrangian formulation. Kuhn-Tucker (or Karush-Kuhn-Tucker) theorem for socalled convex optimization is used for this purpose. The most important reason of this reformulation is that the training data will appear only in the form of dot product. This is a crucial feature that allows generalizing the procedure to the nonlinear case. Reformulated problem: Maximize: l Ld i i 1 1 i , j l i j yi y j xi x j 2 i , j 1 (4) subject to: 0 i C (5) l i yi 0 i 1 (6) Where αi,j are so called Lagrangian multiplyers and C is a user set constant determining the required calculation accuracy. 2.1 SVM Kernel functions The above methods can be generalized to the case, where the decision function is not a linear function of the data. In some cases it might not be possible to separate training data with hyperplane and relaxing constraints because this would lead to a poor classification. Possible solution is to map the data to some other (multidimensional, even infinite dimensional) Euklidian space H, where it is possible to separate the data with a hyperplane. Let assume the mapping Φ: : Rd H (7) where R d is the space of training data and H is the transformed (Hilbert) space. Please note, that data appears in the training problem only in the form of a dot product, Eqs.(4), (5), (6). Let us introduce a “kernel function” K such that K xi , x j ( xi ) ( x j ) . Now it is possible to replace usual dot products by K everywhere in the algorithm without knowing what Φ is. One example is: x x K ex xi , x j e i j 2 2 (8) In this particular example, H is infinite dimensional. It would not be possible to work with Φ explicitly. If xi x j is replaced with K ex xi , x j everywhere in the algorithm, SVM lives in an infinite dimensional space. The training time is roughly the same as that by un-mapped data. All the considerations of the previous section hold, since linear separation is still done, but in the different space. Using Φ is also avoided in the test phase. 3 ARTMAP family neural networks ARTMAP neural networks belong to the class of neural networks called Adaptive Resonance Theory (ART), a theory of cognitive information processing in the human brain. Based on this theory, a whole family of neural network algorithms was developed. These neural networks were shown to give a very good performance in applications involving clustering, classification, and pattern recognition. When compared to statistical and other neural-network-based clustering/classification algorithms, these networks usually obtain very good classification accuracy, while securing proven stability and a high level of compression in the system.. From the point of view of this study, the currently available ARTMAP classification systems can be divided into two groups. First, systems based on (or systems that are a modification of) fuzzy ARTMAP algorithm (e.g., ARTMAP-IC, ART-EMAP, etc). All these systems share the property that they prefer data clusters distributed into hyper-rectangles in feature space. In these systems the basic properties of the original ARTMAP design (stability, proven convergence, fast on-line learning) are preserved, but they also have well-known disadvantages, e.g., noise sensitivity and tendency to category proliferation. The other group is based on the Gaussian ARTMAP neural network. In this group of networks, preferably identifying Gaussian-shaped clusters, the stability and fast on-line learning properties of the fuzzy ARTMAP networks are traded for an emphasis on the ability of the system to generalize and for its decreased sensitivity to noise in the input data. Structurally, every ARTMAP network (fuzzy ARTMAP or Gaussian ARTMAP) can be divided into two parts. The first part, represented by an ART module, dynamically generates units, each identifying a single data cluster in feature space. This part can be used autonomously for cluster analysis of a given data set. The second part serves to identify each of the clusters found in the data with one of the classes defined on the data set. A detailed description of fuzzy ARTMAP (FA), first of the algorithms analyzed in this study, can be found in many previously published studies. From the point of view of this study, the most important property of this system is that the subsystem identifying clusters in feature space preferably identifies the clusters in which patterns are distributed as hyper-rectangles as illustrated in the following figure. Figure 2 Distribution of discrimination rectangles defined by fuzzy ARTMAP in the feature space. 4 Experimental results Experiments were done on benchmark and real-world data. Thorsten Joachims implementation of SVM was used ([2]). Simple extension to the algorithm was done in order to achieve multiclass classification. In all cases Radial Basis Function (RBF) kernel was used Classification accuracy was assessed by a contingency table approach. There were 2-benchmark datasets prepared for classification purposes. “Circle in the square” and “double spiral” were used for dichotomous classification purposes. The results of both Fuzzy ARTMAP and the SVM approach are presented in Table 1and Table 2 “Circle in the square” Predicted Actual Class Class A A' B' “Double spiral” Predicted Actual Class B Class A B 99.54% 0.68% A' 93.25% 57.24% 0.46% 99.32% B' 6.75% 42.76% Table 1 SVM results on benchmark datasets “Circle in the square” Predicted Actual Class Class A A' B' “Double spiral” Predicted Actual Class B Class A B 98.34% 2.80% A' 87.59% 9.26% 1.66% 97.20% B' 12.41% 90.74% Table 2 Fuzzy ARTMAP results on benchmark datasets As shown in Table 1, otherwise properly working SVM failed to classify class B of "double spiral" dataset. Reason of this phenomenon remained unclear - increasing error intolerance constant lead to an unaffordable long training time and manipulation with RBF coefficient didn't help eithter. 4.1 Experiments on real-world data Experiments were done on benchmark and also real-world data. Basicly the behaviors of the methods were observed on multispectral image data with the aim to obtain the best classification accuracy on the test data subset. The Košice data consists of a training set of 3164 points in the feature space and of a test set of 3167 points of the feature space. A point in the feature space has 7 real-valued coordinates of the feature space normalized into the interval (0,1) and 7 binary output values. The class of a fact is determined by the output which has a value of one; the other six output values are zero. The data represents 7 attributes of the color spectrum sensed from Landsat satellite. The representation set was determined by a geographer and was supported by ground verification procedure. The main goal was landuse identification using the most precise classification procedure for achieving accurate results. The image was taken over the eastern Slovakia region particularly from the City of Kosice region. There were seven classes of interest picked up for classification procedure as it can be seen in Figure 3. Results of classification procedures are depicted in the form of contingence table 4 SVM with RBF kernel function was used. Figure 3 Original image. Highlighted areas were classified by expert (A – urban area, B – barren fields, C – bushes, D – agricultural fields, E – meadows, F – forests, G – water) Actual Class Predicted Class A’ A B C D E F G 93.51 0.84 0.00 0.00 3.32 0.00 1.34 B’ 0.61 88.30 0.00 3.56 12.76 0.00 1.34 C’ 0.00 0.00 100.00 0.00 2.27 0.00 0.00 D’ 0.00 8.45 0.00 96.33 0.52 0.00 0.00 E’ 3.25 2.47 0.00 0.11 79.55 0.16 5.80 F’ 0.00 0.00 0.00 0.00 0.00 98.97 2.68 G’ 2.64 0.00 0.00 0.00 1.57 0.87 88.84 Actual Class Predicted Class A’ A B C D E F G 96.15 0.00 0.00 0.00 2.76 0.00 1.41 B’ 0.00 87.68 0.00 3.01 6.08 0.00 0.00 C’ 0.00 1.64 100.0 0.11 1.10 0.00 0.00 D’ 0.00 7.60 0.00 96.88 0.00 0.09 1.41 E’ 0.64 2.05 0.00 0.00 83.98 0.34 8.45 F’ 3.21 1.03 0.00 0.00 6.08 99.49 1.41 G’ 0.00 0.00 0.00 0.00 0.00 0.09 87.32 Tables 3 and 4 : Confusion matrix for fuzzy Artmap neural network with voting from 5 networks. The overall weighted PCC is 93.95 %. ;Confusion matrix for SVM. The overall weighted PCC is 95.64 %. 