Visualizing some multi-class erosion data using kernel methods Anna Bartkowiak1 and Niki Evelpidou2 1 2 Institute of Computer Science, Wroclaw University aba@ii.uni.wroc.pl Remote Sensing Laboratory, Geology Department, University of Athens evelpidou@geol.uoa.gr Summary. Using a given data set (the Kefallinia erosion data) with only 3 dimensions and with fractal correlation dimension rGP ≈ 1.60, we wanted to see, what really by the kernel methods is provided. We have used Gaussian kernels with various kernel width σ. In particular we wanted to find out, whether the GDA (Generalized Discrimination Analysis) as proposed by Baudat and Anouar (2000), permits to distinguish better the high, medium and low erosion classes as compared to the classical Fisherian discriminant analysis. The general result is that the GDA yields discriminant variates permitting for better differentiation among groups, however the calculations are more lengthy. Key words: Generalized discriminant analysis, Gaussian kernels, Effect of kernel width, Visualization of canonical variates 1 Introduction The classical canonical discriminant analysis based on the Between and Within cross products [Lach75, DHS01] uses in fact only linear discriminant functions. A variety of other methods may be found in [HTB97, DHS01, AbaAsz00]. A generalization to non-linear discriminant analysis is provided by the kernel methods [BAnou00, Yang04, RotSt00]. Our concern is in the kernel methods, in particular in the method called GDA (Generalized Discriminant Analysis) proposed by Baudat and Anouar [BAnou00]. In the case of 3 groups it permits to construct more then two generalized canonical variates which may serve for a more detailed visual display and analysis of the considered data. In Section 2 we describe briefly the data. Section 3 visualizes the data using the classical LDA (Linear Discriminant Analysis). In Section 4 we present the corresponding results obtained by GDA (Generalized Discriminant Analysis) using various width of the applied Gaussian kernels. Section 5 contains a summary and discussion on the obtained results. 806 Bartkowiak and Evelpidou 2 The Erosion Data The data were gathered by a team from the Remote Sensing Laboratory, University of Athens (RSL-UOA) in the Greek island Kefallinia. The entire island was covered by a grid containing 3422 cells. The area covered by each cell of the grid was characterized by several variables. For our purpose, to illustrate the visualization concepts, we will consider only 3 variables: drainage density, slope and vulnerability of the soil (rocks). The values of the variables were re-scaled (normalized) to belong to the interval [0, 1]. The data contained 2 very severe atypical observations. They were removed for the analysis presented hereafter – to not confound the outlier effect with the kernel width effect. Thus, for our analysis, we got a data set containing N=3420 data vectors, each vector characterized by d = 3 variables. An expert GIS system, installed in the RSLUOA, was used for assigning each data vector to one of 3 erosion classes: 1. high, 2. medium, 3. low. Let c = 3 denote the number of classes. A 3D plot of the data is shown in Figure 1. 3D Erosion data 1 high medium low n vulnerability 0.8 0.6 0.4 0.2 0.5 0 n drainage density 0 0.5 n slope 1 Fig. 1. The Kefallinia erosion data containing N=3420 data points (with 2 outliers removed). The data set is subdivided into 3 classes of progressing erosion risk: low (bottom), medium (in the middle), high (top). Notice the overlap of the medium class with the low and high erosion classes. Looking at the plot in Figure 1 one may state that, generally, the distribution of the data is far from normality, also far from ellipsoidal shape. Some parts of the space show a great concentration of the data points, while some other parts are sparsely populated. The Euclidean dimension of the data is d = 3; however the intrinsic dimension of the data, as calculated by the Grassberger-Proccacia index [GP83] called fractal correlation dimension, is only ≈ 1.60. Therefore we feel justified to represent the data in a plane. For our analysis we have subdivided our data set into two parts of equal size n = 1710. The first part – called in the following sample1 – is for learning; the second part – sample2 – is for testing. In the next section we will find 2 canonical discriminant variates containing most of the discriminative power of the analyzed data set. Visualizing multi-class erosion data using Gaussian kernels 807 3 Classical Canonical Discriminant Analysis (LDA) The canonical discriminant functions are derived from the Fisher’s criterion. The problem is to find the linear combination a of the variables, which separates the considered classes as much as possible. The criterion of separateness, proposed by Fisher, is the ratio of between class to within class scatter (variance) (see [Lach75, DHS01]). The ratio is here denoted as λ. Big values of λ indicate for a good separation between classes. To be meaningful, the ratio λ should be greater then 1.0; the greater, the better discrimination. For given data, we may obtain mostly h = min{(c−1), d} linear combinations as solutions of the stated problem (c denotes the number of classes and d the number of variables). These h linear combinations are called canonical discriminant functions (CDFs). They are used for projecting the data points to the canonical discriminant space. The derived projections are called canonical discriminant variates (CDVs). For our erosion data we got h = 2 CDFs. They yielded 2 CDVs. In Figure 2 we show data points displayed in the coordinate system of the obtained CDVs. Left panel shows the display of the entire data set; right panel – the display from the learning set sample1. CanD sample n=1710 λ=19.9 d=3 CanDiscr lambda=20.4623 d=3 4 high med low 2 0 −2 high med low 3 4 Canonical Variate no.2 Canonical Variate no.2 6 HIGH LOW 2 1 0 −1 −2 LOW −4 −5 0 5 Canonical Variate no.1 10 −3 −5 0 5 Canonical Variate no.1 HIGH 10 Fig. 2. Projections of the data using canonical discriminant functions. Left: Entire data set N=3420. Right: Halved data set (sample1 ) n=1720. The first canonical variate has big discriminative power: λ1 ≈ 20; the second canonical variate has no discriminative power: λ2 < 0.01 The discriminative power of the derived canonical variates (CDVs) is indicated by the magnitude of the corresponding values of their λ statistics [Lach75, DHS01]. For our data the first CDV has a very big discriminative power (λ1 ≈ 20.0), while the second one has no discriminative power (λ2 < 0.01). However we use also the second CDV, because it is helpful in perceiving the differentiation between the 3 classes. 808 Bartkowiak and Evelpidou As may be seen in Figure 2, the separation of the groups is not ideal: a noticeable number of low and high erosion data points is overlapping with the medium erosion class. The direction of the 1st CDV shows the increase of the erosion risk: from a very small (left) to a very high (right). 4 Generalized Discriminant Analysis GDA Baudat and Anouar [BAnou00] proposed a generalization of the LDA to nonlinear problems and elaborated an algorithm called Generalized Discriminant Analysis (GDA). The main idea of GDA is to map the input space into a convenient feature space in which variables are nonlinearly related to the input space. The algorithm maps the original input space into an extended high dimensional feature space with linear properties. The original nonlinear problem is solved in the extended space in a classical way – by using the LDA. Generally, the mapping reads : X → F, where X is the input space (original data), and F is the extended feature space, usually of higher dimensionality as the original data space. The mapping transforms elements x ∈ X from the original data space into elements φ(x) ∈ F located in the feature space. The transformation is done using so called Mercer kernels – Gaussian RBFs (Radial Basis Functions) or polynomials are representatives of such kernels. The fact of mapping original data in a nonlinear way into a high dimensional feature space – by using Mercer kernels – was originally applied in the domain of support vector machines (SVM), where so called ’kernel trick’ was applied (see e.g. [DHS01]): we do not need explicitly evaluate the values φ(x); instead we seek a formulation of the algorithm which uses only the dot products kϕ (x, y) = (φ(x) · φ(y)), with x, y denoting two data vectors, and when the dot product kϕ (x, y) in the extended feature space F can be evaluated directly from the input vectors x, y ∈ X . Baudat and Anouar [BAnou00] elaborated an algorithm (GDA) performing linear discriminant analysis in the extended feature space - using only dot products from the input space. They have implemented their algorithm in Matlab and made it openly accessible at http://www.kernel-machines.org/. The GDA algorithm starts from computing the kernel matrix KN×N = {k(i, j)}, i, j = 1 . . . N obtained in the following way: Let XN×d denote the data matrix, with row vectors xi and xj . Let dij = xi − xj . Then k(i, j) is computed as k(i, j) = exp{−(dij dT ij ) /σ}, where the kernel width σ has to be declared by the user. The derived kernel matrix KN×N is the basis for computing the GDA discriminant functions in F. The results of applying the GDA algorithm to the erosion data – when using Gaussian kernels – with σ = 0.5, 0.05, 0.005, 0.001 are shown in Figure 3. In first place, the GDA functions were calculated - using as learning sample the first part of the data (sample1 ). Next, using the derived functions, we calculated the GDA variates – for the same data set (i.e. sample1 ). These variates are displayed in Figure 3 above; the exhibits in subsequent panels were obtained using the kernel width σ = 0.5, 0.05, 0.005, 0.001 appropriately. The discriminative power of the derived GDA variates is usually characterized by two indices [DHS01, BAnou00, Yang04, BAnou03]: Visualizing multi-class erosion data using Gaussian kernels GDA for SAMP1 sigma = 0.5 GDA SAMP1 sigma = 0.05 0.3 0.25 low medium high 0.1 0 −0.1 HIGH LOW −0.3 −0.2 0 0.2 0.4 1st GDA coordinate GDA SAMP1 sigma = 0.005 0.25 0.1 0.05 0 −0.1 −0.4 0.6 LOW HIGH −0.2 0 0.2 1st GDA coordinate GDA SAMP1 sigma = 0.001 0.4 0.12 low med high 0.2 2nd GDA coordinate 0.15 −0.05 low med high 0.1 2nd GDA coordinate −0.4 −0.4 low medium high 0.2 2nd GDA coordinate 2nd GDA coordinate 0.2 −0.2 809 0.08 0.15 0.06 0.1 0.04 0.05 0.02 0 LOW HIGH −0.05 −0.1 −0.1 LOW 0 HIGH −0.02 0 0.1 0.2 1st GDA coordinate 0.3 −0.04 −0.08 −0.06 −0.04 −0.02 0 1st GDA coordinate 0.02 0.04 Fig. 3. GDA discriminant variates for four values of Gaussian kernel width: sigma=0.05, 0.05, 0.005 and 0.001 appropriately. Results obtained on the base of the learning sample (sample1 ) counting n=1710 data vectors. Note the increasing differentiation among the groups progressing with the inverse of the kernel width. Both derived variates have a great discriminative power i) λ – the ratio of the between to the within class scatter of the derived GDA variates; the larger λ, the better discrimination; ii) inertia – the ratio of the between to the total scatter of the derived GDA variate. By definition, 0 ≤ inertia ≤ 1; the closer to 1, the greater concentration of points-projections belonging to the same class. Below we show the statistics inertia obtained when using Gaussian kernels with kernel width σ equal to 0.05, 0.05, 0.005, 0,0001. The values were obtained using the GDA software by Baudat & Anouar. Additionally we show first 2 eigenvalues 810 Bartkowiak and Evelpidou of the respective kernel matrices K – they tell us about the resolution of the GDA projections. Table 1. Inertia of first two GDA variates evaluated from the learning set sample1 for 4 kernel widths σ (SIGMA) and first 2 eigenvalues of the respective kernel matrix SIGMA inertia1 inertia2 eigval1(K) eigval2(K) 0.5 0.965613 0.385237 308.0346 139.3207 0.05 0.984177 0.810295 299.7271 249.1643 0.005 0.997787 0.957456 101.4559 99.0073 0.001 0.999970 0.997374 49.683 44.996 Looking at the plots exhibited in Figure 3 one may see how various values of the parameter σ (SIGMA) effect the degree of the separability of the classes. For very small σ we have a very good separability. Generally, when decreasing σ we improve the between class separation; at the same time the projections from one class become more and more concentrated. Thus, at the same time, the within class scatter becomes smaller and smaller – which is indicated by the respective inertias. This happens, when considering the learning sample. What happens for the test sample, which is intended to show, whether the derived GDA functions possess the ability for generalization? To find this out, we took the GDA functions obtained from sample1 and used them for projecting points both from sample1 and sample2 (the test sample). The obtained plots – obtained for kernel width σ = 0.001 – are shown in Figure 4. Looking at the upper exhibit in Figure 4 and at the indices in Table 1 we state that sample1 (plot depicted now in greater resolution) yields the erosion classes pretty separated. The projections of data vectors belonging to the high erosion class (depicted as circles) are concentrated in the left bottom corner; the medium erosion class points (squares) are concentrated in the left upper corner. The low erosion data points appear in the right bottom corner – they are the most concentrated. What concerns the test sample displayed in the bottom plot, it falls in the regions around the projections of the learning sample. However the projections of the test sample are much more scattered then those from the learning group. One may also note two wrongly classified points indicated (in the bottom exhibit) by arrows: One point from the medium (MED) erosion class appears located in the HIGH erosion area; other point from the LOW erosion class appears isolated in the right top corner and seems not belong to any of the 3 considered erosion classes. 5 Discussion and Closing Remarks We think that the kernel approach is a fascinating and useful approach. It can provide interesting insight into the data. The applied mathematical tool gives possibilities not imaginable when using the classical LDA. The derived canonical GDA functions, when used for visualization, yield a better differentiation between the erosion classes. Also, for c = 3 classes, they may yield more than 2 meaningful Visualizing multi-class erosion data using Gaussian kernels 811 GDA from SAMP1 sigma=0.001 0.12 0.1 MED 2nd GDA coordinate 0.08 0.06 0.04 0.02 0 HIGH LOW −0.02 −0.04 −0.08 −0.06 −0.04 −0.02 0 1st GDA coordinate 0.02 0.04 GDA from SAMP1 sigma=0.001 TEST 0.15 2nd GDA coordinate 0.1 MED 0.05 LOW 0 HIGH −0.05 −0.1 −0.06 −0.02 0.02 1st GDA coordinate 0.06 0.1 Fig. 4. GDA variates constructed kernel width σ = 0.001 using sample1 for learning. Upper plot: projection of the data set sample1 onto the derived GDA variates. Bottom plot: projection of the test data set sample2 when using the GDA functions derived from sample1. Note two seemingly wrongly allocated points-projections indicated by arrows: top right and bottom left discriminants, which may serve for additional displays illustrating, e.g., contrasts between classes. There are also some disadvantages: the difficulty to find the parameters of the kernels (if any) and the lengthy calculations. E.g., for n = 1710 (sample1 ), using a PC under MS XHome system with Intel(R) Pentium(R) 4 1.80 GHz 512 MB RAM we needed about 12 minutes to obtain the mapping of the data – for one value of the parameter SIGMA. We considered the following modifications (simplifications): 1) Removing from the learning data identical or very similar data instances – we might that way obtain 812 Bartkowiak and Evelpidou a reduction of the size of the data and speed up the calculations. 2) Using for learning a balanced (in class sizes) training sample – which could yield a better generalization. We have tried out these proposals with our data subdivided into 5 classes of erosion risk. Taking a set of 800 balanced representatives for learning we got indeed a speed up of the calculations and a better generalization ability of the derived GDA functions. The results are not shown here. References [AbaAsz00] Bartkowiak, A., Szustalewicz, A.: Two non-conventional methods for visualization of multivariate two-group data. Biocybernetics and Bioengineering 20/4, 5–20 (2000) [BAnou00] Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Computation 12, 2385–2404 (2000) [BAnou03] Baudat, G., Anouar, F.: Feature vector selection and projection using kernels. Neurocomputing 55, 21–38 (2003) [DHS01] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd Edition. Wiley (2001) [GP83] Grassberger, P., Procaccia, I.: Measuring the strangeness of strange attractors. Physica D9, 189–208 (1983) [HTB97] Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis. JASA 89, 1255–1270 (1994) [Lach75] Lachenbruch, P. Discriminant Analysis. Hafner Press (1975) [RotSt00] Roth., V., Steinhage, V.: Nonlinear discriminant analysis using kernel functions. NIPS12, 568–574, MIT Press (2000) [Yang04] Yang J., Jin Z., et. al.: Essence of kernel Fisher discriminant: KPCA plus LDA. Pattern Recognition 37, 2097-2100 (2004)