Visualizing some multi-class erosion data using kernel methods

advertisement
Visualizing some multi-class erosion data using
kernel methods
Anna Bartkowiak1 and Niki Evelpidou2
1
2
Institute of Computer Science, Wroclaw University aba@ii.uni.wroc.pl
Remote Sensing Laboratory, Geology Department, University of Athens
evelpidou@geol.uoa.gr
Summary. Using a given data set (the Kefallinia erosion data) with only 3 dimensions and with fractal correlation dimension rGP ≈ 1.60, we wanted to see, what
really by the kernel methods is provided. We have used Gaussian kernels with various
kernel width σ. In particular we wanted to find out, whether the GDA (Generalized Discrimination Analysis) as proposed by Baudat and Anouar (2000), permits
to distinguish better the high, medium and low erosion classes as compared to the
classical Fisherian discriminant analysis. The general result is that the GDA yields
discriminant variates permitting for better differentiation among groups, however
the calculations are more lengthy.
Key words: Generalized discriminant analysis, Gaussian kernels, Effect of kernel
width, Visualization of canonical variates
1 Introduction
The classical canonical discriminant analysis based on the Between and Within cross
products [Lach75, DHS01] uses in fact only linear discriminant functions. A variety
of other methods may be found in [HTB97, DHS01, AbaAsz00]. A generalization
to non-linear discriminant analysis is provided by the kernel methods [BAnou00,
Yang04, RotSt00].
Our concern is in the kernel methods, in particular in the method called GDA
(Generalized Discriminant Analysis) proposed by Baudat and Anouar [BAnou00].
In the case of 3 groups it permits to construct more then two generalized canonical
variates which may serve for a more detailed visual display and analysis of the
considered data.
In Section 2 we describe briefly the data. Section 3 visualizes the data using the
classical LDA (Linear Discriminant Analysis). In Section 4 we present the corresponding results obtained by GDA (Generalized Discriminant Analysis) using various width of the applied Gaussian kernels. Section 5 contains a summary and discussion on the obtained results.
806
Bartkowiak and Evelpidou
2 The Erosion Data
The data were gathered by a team from the Remote Sensing Laboratory, University
of Athens (RSL-UOA) in the Greek island Kefallinia. The entire island was covered
by a grid containing 3422 cells. The area covered by each cell of the grid was characterized by several variables. For our purpose, to illustrate the visualization concepts,
we will consider only 3 variables: drainage density, slope and vulnerability of the
soil (rocks). The values of the variables were re-scaled (normalized) to belong to the
interval [0, 1]. The data contained 2 very severe atypical observations. They were
removed for the analysis presented hereafter – to not confound the outlier effect with
the kernel width effect.
Thus, for our analysis, we got a data set containing N=3420 data vectors, each
vector characterized by d = 3 variables. An expert GIS system, installed in the RSLUOA, was used for assigning each data vector to one of 3 erosion classes: 1. high,
2. medium, 3. low. Let c = 3 denote the number of classes. A 3D plot of the data is
shown in Figure 1.
3D Erosion data
1
high
medium
low
n vulnerability
0.8
0.6
0.4
0.2
0.5
0
n drainage density
0
0.5
n slope
1
Fig. 1. The Kefallinia erosion data containing N=3420 data points (with 2 outliers
removed). The data set is subdivided into 3 classes of progressing erosion risk: low
(bottom), medium (in the middle), high (top). Notice the overlap of the medium
class with the low and high erosion classes.
Looking at the plot in Figure 1 one may state that, generally, the distribution of
the data is far from normality, also far from ellipsoidal shape. Some parts of the space
show a great concentration of the data points, while some other parts are sparsely
populated. The Euclidean dimension of the data is d = 3; however the intrinsic
dimension of the data, as calculated by the Grassberger-Proccacia index [GP83]
called fractal correlation dimension, is only ≈ 1.60. Therefore we feel justified to
represent the data in a plane.
For our analysis we have subdivided our data set into two parts of equal size n
= 1710. The first part – called in the following sample1 – is for learning; the second
part – sample2 – is for testing.
In the next section we will find 2 canonical discriminant variates containing most
of the discriminative power of the analyzed data set.
Visualizing multi-class erosion data using Gaussian kernels
807
3 Classical Canonical Discriminant Analysis (LDA)
The canonical discriminant functions are derived from the Fisher’s criterion. The
problem is to find the linear combination a of the variables, which separates the
considered classes as much as possible. The criterion of separateness, proposed by
Fisher, is the ratio of between class to within class scatter (variance) (see [Lach75,
DHS01]). The ratio is here denoted as λ. Big values of λ indicate for a good separation
between classes. To be meaningful, the ratio λ should be greater then 1.0; the greater,
the better discrimination.
For given data, we may obtain mostly h = min{(c−1), d} linear combinations as
solutions of the stated problem (c denotes the number of classes and d the number of
variables). These h linear combinations are called canonical discriminant functions
(CDFs). They are used for projecting the data points to the canonical discriminant
space. The derived projections are called canonical discriminant variates (CDVs).
For our erosion data we got h = 2 CDFs. They yielded 2 CDVs.
In Figure 2 we show data points displayed in the coordinate system of the obtained CDVs. Left panel shows the display of the entire data set; right panel – the
display from the learning set sample1.
CanD sample n=1710 λ=19.9 d=3
CanDiscr lambda=20.4623 d=3
4
high
med
low
2
0
−2
high
med
low
3
4
Canonical Variate no.2
Canonical Variate no.2
6
HIGH
LOW
2
1
0
−1
−2
LOW
−4
−5
0
5
Canonical Variate no.1
10
−3
−5
0
5
Canonical Variate no.1
HIGH
10
Fig. 2. Projections of the data using canonical discriminant functions. Left: Entire
data set N=3420. Right: Halved data set (sample1 ) n=1720. The first canonical
variate has big discriminative power: λ1 ≈ 20; the second canonical variate has no
discriminative power: λ2 < 0.01
The discriminative power of the derived canonical variates (CDVs) is indicated
by the magnitude of the corresponding values of their λ statistics [Lach75, DHS01].
For our data the first CDV has a very big discriminative power (λ1 ≈ 20.0), while
the second one has no discriminative power (λ2 < 0.01). However we use also the
second CDV, because it is helpful in perceiving the differentiation between the 3
classes.
808
Bartkowiak and Evelpidou
As may be seen in Figure 2, the separation of the groups is not ideal: a noticeable
number of low and high erosion data points is overlapping with the medium erosion
class. The direction of the 1st CDV shows the increase of the erosion risk: from a
very small (left) to a very high (right).
4 Generalized Discriminant Analysis GDA
Baudat and Anouar [BAnou00] proposed a generalization of the LDA to nonlinear problems and elaborated an algorithm called Generalized Discriminant Analysis
(GDA). The main idea of GDA is to map the input space into a convenient feature
space in which variables are nonlinearly related to the input space. The algorithm
maps the original input space into an extended high dimensional feature space with
linear properties. The original nonlinear problem is solved in the extended space in
a classical way – by using the LDA.
Generally, the mapping reads : X → F, where X is the input space (original
data), and F is the extended feature space, usually of higher dimensionality as the
original data space. The mapping transforms elements x ∈ X from the original data
space into elements φ(x) ∈ F located in the feature space. The transformation is
done using so called Mercer kernels – Gaussian RBFs (Radial Basis Functions) or
polynomials are representatives of such kernels.
The fact of mapping original data in a nonlinear way into a high dimensional
feature space – by using Mercer kernels – was originally applied in the domain
of support vector machines (SVM), where so called ’kernel trick’ was applied (see
e.g. [DHS01]): we do not need explicitly evaluate the values φ(x); instead we seek
a formulation of the algorithm which uses only the dot products kϕ (x, y) = (φ(x) ·
φ(y)), with x, y denoting two data vectors, and when the dot product kϕ (x, y) in the
extended feature space F can be evaluated directly from the input vectors x, y ∈ X .
Baudat and Anouar [BAnou00] elaborated an algorithm (GDA) performing linear
discriminant analysis in the extended feature space - using only dot products from
the input space. They have implemented their algorithm in Matlab and made it
openly accessible at http://www.kernel-machines.org/.
The GDA algorithm starts from computing the kernel matrix KN×N =
{k(i, j)}, i, j = 1 . . . N obtained in the following way: Let XN×d denote the data
matrix, with row vectors xi and xj . Let dij = xi − xj . Then k(i, j) is computed as
k(i, j) = exp{−(dij dT
ij ) /σ}, where the kernel width σ has to be declared by the user.
The derived kernel matrix KN×N is the basis for computing the GDA discriminant
functions in F.
The results of applying the GDA algorithm to the erosion data – when using
Gaussian kernels – with σ = 0.5, 0.05, 0.005, 0.001 are shown in Figure 3.
In first place, the GDA functions were calculated - using as learning sample the
first part of the data (sample1 ). Next, using the derived functions, we calculated the
GDA variates – for the same data set (i.e. sample1 ). These variates are displayed
in Figure 3 above; the exhibits in subsequent panels were obtained using the kernel
width σ = 0.5, 0.05, 0.005, 0.001 appropriately.
The discriminative power of the derived GDA variates is usually characterized
by two indices [DHS01, BAnou00, Yang04, BAnou03]:
Visualizing multi-class erosion data using Gaussian kernels
GDA for SAMP1
sigma = 0.5
GDA SAMP1 sigma = 0.05
0.3
0.25
low
medium
high
0.1
0
−0.1
HIGH
LOW
−0.3
−0.2
0
0.2
0.4
1st GDA coordinate
GDA SAMP1 sigma = 0.005
0.25
0.1
0.05
0
−0.1
−0.4
0.6
LOW
HIGH
−0.2
0
0.2
1st GDA coordinate
GDA SAMP1 sigma = 0.001
0.4
0.12
low
med
high
0.2
2nd GDA coordinate
0.15
−0.05
low
med
high
0.1
2nd GDA coordinate
−0.4
−0.4
low
medium
high
0.2
2nd GDA coordinate
2nd GDA coordinate
0.2
−0.2
809
0.08
0.15
0.06
0.1
0.04
0.05
0.02
0
LOW
HIGH
−0.05
−0.1
−0.1
LOW
0
HIGH
−0.02
0
0.1
0.2
1st GDA coordinate
0.3
−0.04
−0.08 −0.06 −0.04 −0.02
0
1st GDA coordinate
0.02
0.04
Fig. 3. GDA discriminant variates for four values of Gaussian kernel width:
sigma=0.05, 0.05, 0.005 and 0.001 appropriately. Results obtained on the base of
the learning sample (sample1 ) counting n=1710 data vectors. Note the increasing
differentiation among the groups progressing with the inverse of the kernel width.
Both derived variates have a great discriminative power
i) λ – the ratio of the between to the within class scatter of the derived GDA
variates; the larger λ, the better discrimination;
ii) inertia – the ratio of the between to the total scatter of the derived GDA variate.
By definition, 0 ≤ inertia ≤ 1; the closer to 1, the greater concentration of
points-projections belonging to the same class.
Below we show the statistics inertia obtained when using Gaussian kernels with
kernel width σ equal to 0.05, 0.05, 0.005, 0,0001. The values were obtained using
the GDA software by Baudat & Anouar. Additionally we show first 2 eigenvalues
810
Bartkowiak and Evelpidou
of the respective kernel matrices K – they tell us about the resolution of the GDA
projections.
Table 1. Inertia of first two GDA variates evaluated from the learning set sample1
for 4 kernel widths σ (SIGMA) and first 2 eigenvalues of the respective kernel matrix
SIGMA
inertia1
inertia2
eigval1(K)
eigval2(K)
0.5
0.965613
0.385237
308.0346
139.3207
0.05
0.984177
0.810295
299.7271
249.1643
0.005
0.997787
0.957456
101.4559
99.0073
0.001
0.999970
0.997374
49.683
44.996
Looking at the plots exhibited in Figure 3 one may see how various values of
the parameter σ (SIGMA) effect the degree of the separability of the classes. For
very small σ we have a very good separability. Generally, when decreasing σ we
improve the between class separation; at the same time the projections from one
class become more and more concentrated. Thus, at the same time, the within class
scatter becomes smaller and smaller – which is indicated by the respective inertias.
This happens, when considering the learning sample. What happens for the test
sample, which is intended to show, whether the derived GDA functions possess the
ability for generalization?
To find this out, we took the GDA functions obtained from sample1 and used
them for projecting points both from sample1 and sample2 (the test sample). The
obtained plots – obtained for kernel width σ = 0.001 – are shown in Figure 4.
Looking at the upper exhibit in Figure 4 and at the indices in Table 1 we state
that sample1 (plot depicted now in greater resolution) yields the erosion classes
pretty separated. The projections of data vectors belonging to the high erosion class
(depicted as circles) are concentrated in the left bottom corner; the medium erosion
class points (squares) are concentrated in the left upper corner. The low erosion data
points appear in the right bottom corner – they are the most concentrated.
What concerns the test sample displayed in the bottom plot, it falls in the regions
around the projections of the learning sample. However the projections of the test
sample are much more scattered then those from the learning group. One may also
note two wrongly classified points indicated (in the bottom exhibit) by arrows: One
point from the medium (MED) erosion class appears located in the HIGH erosion
area; other point from the LOW erosion class appears isolated in the right top corner
and seems not belong to any of the 3 considered erosion classes.
5 Discussion and Closing Remarks
We think that the kernel approach is a fascinating and useful approach. It can
provide interesting insight into the data. The applied mathematical tool gives possibilities not imaginable when using the classical LDA. The derived canonical GDA
functions, when used for visualization, yield a better differentiation between the
erosion classes. Also, for c = 3 classes, they may yield more than 2 meaningful
Visualizing multi-class erosion data using Gaussian kernels
811
GDA from SAMP1 sigma=0.001
0.12
0.1
MED
2nd GDA coordinate
0.08
0.06
0.04
0.02
0
HIGH
LOW
−0.02
−0.04
−0.08
−0.06
−0.04
−0.02
0
1st GDA coordinate
0.02
0.04
GDA from SAMP1 sigma=0.001 TEST
0.15
2nd GDA coordinate
0.1
MED
0.05
LOW
0
HIGH
−0.05
−0.1
−0.06
−0.02
0.02
1st GDA coordinate
0.06
0.1
Fig. 4. GDA variates constructed kernel width σ = 0.001 using sample1 for learning. Upper plot: projection of the data set sample1 onto the derived GDA variates.
Bottom plot: projection of the test data set sample2 when using the GDA functions
derived from sample1. Note two seemingly wrongly allocated points-projections indicated by arrows: top right and bottom left
discriminants, which may serve for additional displays illustrating, e.g., contrasts
between classes.
There are also some disadvantages: the difficulty to find the parameters of the
kernels (if any) and the lengthy calculations. E.g., for n = 1710 (sample1 ), using a
PC under MS XHome system with Intel(R) Pentium(R) 4 1.80 GHz 512 MB RAM
we needed about 12 minutes to obtain the mapping of the data – for one value of
the parameter SIGMA.
We considered the following modifications (simplifications): 1) Removing from
the learning data identical or very similar data instances – we might that way obtain
812
Bartkowiak and Evelpidou
a reduction of the size of the data and speed up the calculations. 2) Using for learning
a balanced (in class sizes) training sample – which could yield a better generalization.
We have tried out these proposals with our data subdivided into 5 classes of
erosion risk. Taking a set of 800 balanced representatives for learning we got indeed
a speed up of the calculations and a better generalization ability of the derived GDA
functions. The results are not shown here.
References
[AbaAsz00] Bartkowiak, A., Szustalewicz, A.: Two non-conventional methods for
visualization of multivariate two-group data. Biocybernetics and Bioengineering 20/4, 5–20 (2000)
[BAnou00] Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Computation 12, 2385–2404 (2000)
[BAnou03] Baudat, G., Anouar, F.: Feature vector selection and projection using
kernels. Neurocomputing 55, 21–38 (2003)
[DHS01]
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd Edition. Wiley (2001)
[GP83]
Grassberger, P., Procaccia, I.: Measuring the strangeness of strange
attractors. Physica D9, 189–208 (1983)
[HTB97]
Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis.
JASA 89, 1255–1270 (1994)
[Lach75]
Lachenbruch, P. Discriminant Analysis. Hafner Press (1975)
[RotSt00] Roth., V., Steinhage, V.: Nonlinear discriminant analysis using kernel
functions. NIPS12, 568–574, MIT Press (2000)
[Yang04]
Yang J., Jin Z., et. al.: Essence of kernel Fisher discriminant: KPCA
plus LDA. Pattern Recognition 37, 2097-2100 (2004)
Download