Fast Hyperspectral Feature Reduction Using Piecewise Constant Function Approximations

advertisement
Fast Hyperspectral Feature Reduction Using
Piecewise Constant Function Approximations
Are C. Jensen, Student member, IEEE and Anne Schistad Solberg, Member, IEEE
Abstract— The high number of spectral bands obtained from
hyperspectral sensors, combined with the often limited ground
truth, solicits some kind of feature reduction when attempting
supervised classification. This paper demonstrates that an optimal constant function representation of hyperspectral signature
curves in the mean square sense is capable of representing the
data sufficiently to outperform, or match, other feature reduction
methods like PCA, SFS and DBFE for classification purposes on
all of the four hyperspectral datasets that we have tested. The
simple averaging of spectral bands makes the resulting features
directly interpretable in a physical sense. Using an efficient
dynamic programming algorithm the proposed method can be
considered fast.
Index Terms— Pattern classification, Remote sensing, Feature
extraction
I. I NTRODUCTION
YPERSPECTRAL sensors provide a rich source of
information to allow an accurate separation of land cover
classes. Often several hundred spectral samples are acquired
for every pixel. Unfortunately, the number of pixels available
for training the classifiers is often severely limited, and in
combination with the high number of spectral bands, the
occurrence of the so-called Hughes phenomenon is almost
inevitable. Furthermore, the spectral samples often exhibit high
correlation adding a redundancy that may obscure the information important for classification. To alleviate these problems
one can reduce the dimensionality by feature reduction, or try
to regularize parameters by biasing them towards simpler and
more stable estimates. This paper focuses on dimensionality
reduction.
Classical techniques for feature reduction in the pattern
recognition literature can be applied by considering the spectral samples as features. This includes feature selection algorithms like sequential forward selection (SFS), forwardbackwards selection, floating search and the more recently proposed fast constrained search algorithm [1][2][3], and linear
transformations like principal components transform (PCA),
Fisher’s canonical transform, decision boundary feature extraction (DBFE) and the discrete wavelet transform [1][4][5].
The linear transforms have a disadvantage over the selection
methods in that direct interpretation of the resulting features
is difficult, i.e., the new features are linear combinations of
all the spectral bands. On the other hand, selecting features
when the number of possible feature candidates is large and
the number of training samples is limited, can lead to incorrect
and unstable selections [6].
H
This work was supported by the Norwegian Research Council.
A. C. Jensen and A. S. Solberg are with the University of Oslo.
The method presented in this paper divides the spectral
curves into contiguous regions by piecewise constant function
approximations. The extracted constants are then used as new
features. The assumption is that the continuous curves that
the features are samples of can be approximated sufficiently
by such functions to allow subsequent classification. By using
piecewise constants instead of higher order polynomials, the
number of resulting features is minimized and the features
are simple averages of contiguous spectral bands allowing
straightforward interpretation. This gives a compromise between the generality of linear transforms and the interpretability of the feature selection techniques.
Averages of contiguous spectral bands is also the end result
in the top-down generalized local discriminant bases (TDGLDB) procedure in [7]. Their algorithm recursively partitions
the spectral curve into two sets of bands and replaces each final
set of bands by their mean value. The criterion for selecting
the partitioning breakpoint is either the performance on a
validation set or using an estimate of the differences in class
probability densities. The same paper also proposes a more
computationally intensive strategy where the simple averaging
to create features are replaced by Fisher’s canonical transform,
and using bottom-up search strategies. Both methods are
relying on a pairwise classification framework.
Another approach based on contiguous band averages is
that of [8], which uses adaptations of selection methods to
extract non-overlapping spectral regions. Instead of iteratively
optimizing a classification criterion as in the latter approach,
the proposed method focuses solely on minimizing the representation error when finding the breakpoints between the
regions to average.
Replacing the piecewise constants by linear functions and
minimizing the representational error for the class mean curves
using a greedy recursive method have been applied in [9].
In Section II it is shown that this method, with the addition
of finding globally optimal breakpoints, can be seen as a
special case of the proposed optimization framework when
considering piecewise linear instead of piecewise constant
segments.
This paper shows that the fairly simple and direct approach
proposed outperforms, or matches, methods like SFS, PCA,
DBFE and TD-GLDB on all of the four hyperspectral data
sets that we have tested.
II. O PTIMAL P IECEWISE C ONSTANT R EPRESENTATION
The goal is to partition the hyperspectral signatures into a
fixed number of contiguous intervals with constant intensities
minimizing the mean square representation error. Let Sij be
the jth feature in the ith pixel of a dataset with a total of N
pixels with M features. S = {Sij |1 ≤ i ≤ N, 1 ≤ j ≤ M }
is thus the collection of all spectral curves (the entire training
dataset) available regardless of class membership. We seek
a set of K breakpoints, P = {p1 , p2 , .., pK }, which define
the contiguous intervals, Ik = [pk , pk+1 ). Note that the K
breakpoints are equal for all the classes. In our model each
interval, for every pixel, is represented by a constant, µik ∈ ℜ.
The square representation error of the model is thus
H=
K X
N X
X
(Sij − µik )2 .
(1)
k=1 i=1 j∈Ik
If the breakpoints, P , are given, one ends up with a simple
square function w.r.t. the constants, µik , so the minimizer of
(1) w.r.t. µik is given by, letting |Ik | denote the number of
elements in the k-th interval,
µik =
1 X
Sij ,
|Ik |
(2)
j∈Ik
i.e., the mean value of each pixel’s interval between breakpoints. What is left to be determined to minimize (1) is thus
the locations of the breakpoints. These breakpoints can be
found using dynamic programming. The algorithm presented
here is based on [10], with the extension of allowing multiple
spectral curves. A more detailed exposition of the dynamic
programming can be found in the mentioned reference. The
following gives the minimum of (1) for all K.
Define
B(K, m) = min Hm
(3)
P,θ,|P |=K
where Hm means (1) where only the first m features are
used, |P | = K means the use of K breakpoints, and θ is the
collection of constants. For M ≥ m ≥ K > 1 we have, since
(3) is decreasing with increasing K, the recurrence relation
B(K, m) =
min
B(K − 1, r) + D[r+1,m]
(4)
Fig. 1. Example of piecewise constant function approximation from the KSC
dataset. Number of segments (K+1) is 21.
the number spectral bands, the inner loop of the algorithm is
fast. Examples of execution times can be found in Section IV.
The procedure can be perceived as finding an orthogonal
linear transform minimizing the square representational error,
with the additional constraint that each row in the resulting
matrix satisfies the criterion of representing the mean of
a single curve segment. Thus, subsequently, the extracted
features are obtained by simple matrix multiplication.
A. Possible extensions of the approach
To balance possibly unequal numbers of samples in the
different classes, and allow a priori class weighting, (1) can
be rewritten
K X
N
X
p(i) X
(Sij − µik )2
H=
n(i)
i=1
k=1
(6)
j∈Ik
K−1≤r≤m−1
where
D[r,m] =
N
X
i=1
min
µ
m
X
(Sij − µ)2 .
(5)
j=r
In short, we end up with Algorithm 1.
Algorithm 1 Finding the minimum of (1) for all K.
1. Set B(1, m) = D[1,m] , 1 ≤ m ≤ M .
2. Determine B(K, m), 1 < K ≤ m ≤ M , from (4), keeping
∗
track of rK,m
giving its minimum.
3. Determine recursively breakpoints from the minimizers, r∗ ,
∗
for B(K, M ), B(K − 1, rK,M
), . . ..
where p(i) and n(i) are the a priori likelihood and the number
of pixels of the class of the ith pixel, respectively. After a
similar alteration on (5), Algorithm 1 is applicable directly.
Replacing the simple piecewise constant function model in
(1) with higher order polynomials is straightforward, but by
using, e.g. a linear model, the number of features for a given
set of breakpoints is doubled. This increase in features would
most often require subsequent feature selection when applied
in a classification setting. However, the method still produces
a linear transform, although not orthogonal.
B. Summarizing the approach
After finding the breakpoints, the constants, µik , from
(2) are applied as features in the classification process. An
example of a piecewise constant function approximation with
K = 20 is shown in Figure 1. Although the complexity of the
algorithm is O(N M 3 ), N being the number of pixels and M
To summarize the approach, training consists of determining
the K breakpoints and the corresponding segment means
minimizing the square representation error. The extracted
parameters are then used in a regular classifier, in our case
a Gaussian Maximum Likelihood classifier.
III. E XPERIMENTS
A. The Datasets
To give a thorough analysis of the performance of the
proposed method, it has been applied on four hyperspectral
datasets containing widely different types of data, and with
dimensions ranging from 81 to 176 bands. The first dataset,
ROSIS, is from an airborne sensor, contains forest type data,
is divided into three classes, has 81 spectral bands and has
a pixel size of 5.6m. The second dataset, DC [11], is from
an airborne sensor, contains urban type data, is divided into
five classes, and has 150 bands. The third set, KSC [12], is
from an airborne sensor (AVIRIS), contains vegetation type
data, divided into 13 classes, has 176 spectral bands and has
18m pixels. The last dataset, BOTSWANA [12], is from a
scene taken by the Hyperion sensor aboard the EO-1 satellite,
contains vegetation type data, is divided into 14 classes, has
145 bands, and has a 30m pixel size. The average number
of training pixels per class is 700, 600, 196 and 115 for the
respective data sets.
B. Methods
To evaluate the performance of the different methods we
applied the standard approach of separating the available
ground-truthed data into roughly equally sized regions for
training and testing, and report performance on the test data.
As far as possible, we have made sure that all regions for
training are spatially disjoint from the regions for testing,
to avoid training on neighboring pixels that are correlated
with test data. In the case of the two datasets with sufficient
numbers of pixels, 5 repeated experiments were designed by
sampling randomly equally sized sets for each class from the
one half of the available data marked for training the classifier.
The proposed method is compared with principal components transform (PCA), sequential forward selection (SFS)
using the sum of the estimated Mahalanobis distances as
feature evaluation function, decision boundary feature extraction (DBFE) using leave-one-out covariance matrix estimate
[13], and top-down generalized local discriminant bases (TDGLDB)[7] using their suggested log-odds probabilities criterion. Simple majority voting was used in the TD-GLDB
pairwise framework, and results using the implementation in
[14] is reported. The PCA is fast, widely used, and is based on
signal representation error, while the DBFE is more elaborate
with the inclusion of class separability in the optimization process. The TD-GLDB is based on averaging of adjacent bands,
and thus have similarities to the proposed method, even though
it requires a pairwise classification setting. More sophisticated
methods for feature selection, e.g. forward-backward selection
and floating selection [1], have been omitted because of their
excessive execution times, and thereby their inappropriateness
to be compared with the other faster methods. The extracted
features are used in a Gaussian Maximum Likelihood (GML)
classifier.
The number of features, e.g. the number of principal components in PCA or the number of contiguous intervals in the
proposed method, were chosen using standard 10-fold crossvalidation on each of the sampled training sets. The average
performance and stability reported is based on the test data in
these 5 experiments.
Due to lack of knowledge about the true priors of the
ground truth classes, the average performance over all classes
is reported in the experiments. This is the case for all the data
sets.
Experiments using the proposed method on data where the
mean value for each feature has been set to zero have also been
conducted. This was done to segregate the general structure
of the spectral curve with the variance found in the data.
IV. R ESULTS
Table I shows that the overall misclassification rate when
the optimal number of features is chosen is lower using
the proposed method on three out of four datasets. The
sequential forward selection (SFS) method has a slightly better
performance on the DC dataset, but the proposed method still
outperforms the PCA, DBFE and TD-GLDB approach.
Except for the TD-GLDB, the misclassification rates of the
methods can, in a natural way, be visualized as functions
with regards to the number of features used. Figures 2 and 3
are representative examples showing the performance as such
functions. On the BOTSWANA dataset, the proposed method
and PCA have quite equal performance, but the proposed
method has a slightly sharper drop in misclassification rate
in the first part of the graph and a lower misclassification
increase as the number of features increase beyond the optimal
point. The KSC results shown in Figure 3 show a distinctly
lower misclassification rate for the proposed method for large
intervals of feature numbers.
When the mean curve of the data has been subtracted, the
proposed method has a small drop in performance in three
out of four datasets, but is still comparable or better than the
other methods. The variances of the misclassification rate, i.e.,
the instabilities of the classifiers, are lower for the proposed
method, with an increase when subtracting the mean. Even
though the estimated placement of the breakpoints in the
proposed method are quite stable across the different trials for
each dataset, the chosen number of regions vary, as is seen
in Table II. Note that although the chosen number of regions
varies with experiments, the error rate remains fairly stable.
Algorithm runtimes are shown in Table III. The time it takes
to find the possible features using the proposed method is
merely a fraction of the time it takes to do the SFSs, and
considerably lower than to do the DBFEs, even with the nonoptimized Matlab version of the proposed method used in
this study. The execution times for the pairwise strategy of
TD-GLDB increase drastically when applied to the datasets
with many classes. Of course, the PCA needs only to solve
an eigenvalue problem, and takes about a second.
V. D ISCUSSION
It is apparent from the results that the spectral grouping and
averaging in the proposed method are capable of representing
the data and capture the essential variation within the spectral
curve. When the mean of the data is subtracted, the general
form of the spectral curves is removed and only the signal
TABLE I
T EST RESULTS WITH THE OPTIMAL NUMBER OF FEATURES CHOSEN BY
CROSSVALIDATION FOR THE FOUR DATASETS USING
PCA, SFS, DBFE
AND THE PROPOSED METHOD .
M EAN MISCLASSIFICATION RATE AND ONE
STANDARD DEVIATION IN PERCENT.
ROSIS
PCA
SFS
DBFE
TD-GLDB
Proposed
Proposed(µ=0)
14.79
14.93
16.44
20.28
14.18
14.90
DC
±0.25
±0.30
±0.65
±0.41
±0.17
±0.18
3.20
2.85
3.89
6.21
2.96
3.03
±0.21
±0.14
±0.20
±0.42
±0.05
±0.15
KSC
BOTSWANA
12.22
12.49
13.33
17.18
10.16
10.39
4.45
5.28
6.20
7.41
3.74
3.29
TABLE II
T HE MEAN NUMBER OF FEATURES AND ONE STANDARD DEVIATION FOR
THE DIFFERENT FEATURE REDUCTION APPROACHES .
PCA
SFS
DBFE
TD-GLDBa
Proposed
Proposed(µ=0)
a The
ROSIS
DC
KSC
BOTSWANA
17.2 ±3.8
19.6 ±2.9
36.2 ±10.6
6.9 ±0.37
20.0 ±1.4
21.2 ±1.6
73.4 ±27
72.8 ±12
141 ±5.5
6.8 ±0.67
64.8 ±20
57.8 ±22
14
18
11
8.6
21
19
14
16
20
8.6
12
20
means of the pairwise classifiers are reported.
TABLE III
E XECUTION TIMES IN SECONDS TO FIND ALL FEATURE SETS RANGING
FROM ONE TO FULL DIMENSION .
SFS
DBFE
TD-GLDBa
Proposed
a TD-GLDB
ROSIS
DC
KSC
BOTSWANA
164
21
35
19
1255
207
226
178
2916
420
2273
239
1509
295
2239
85
finds final features directly.
Fig. 2. Misclassification rate for different numbers of features using PCA,
SFS and the proposed method on the BOTSWANA dataset. Crossvalidation
estimated 12, 14, 16, and 20 features for the proposed method, PCA, SFS
and DBFE, respectively.
Fig. 3. Misclassification rate for different numbers of features using PCA,
SFS and the proposed method on the KSC dataset. Crossvalidation estimated
21, 14, 18, and 11 features for the proposed method, PCA, SFS and DBFE,
respectively.
variance is used to determine the locations of the breakpoints
for the constant function intervals. In this case, the results are
generally the same, but with a little increase in misclassification error. The reason for the decrease in performance is
probably that the variance dominates the process of selecting
the breakpoints anyway, and that the general form of the
spectral curve helps choose more general spectral curve-fitting
intervals. The increase in classification stability aids the latter
explanation. Consequently, in an actual application of the
proposed method, the removal of the mean is generally not
to be recommended.
The TD-GLDB, with its pairwise classification framework,
gives some classes improved discrimination, even though the
overall misclassification rate is significantly higher. Also, it
might be that a more advanced voting scheme would yield a
slightly improved classification rate when using TD-GLDB.
However, the (C − 1)C/2 time complexity increase for a Cclass problem using its pairwise strategy renders the method
unfit for fast feature reduction when the classes are numerous.
The low mean number of selected features using TD-GLDB
as seen in Table II can be explained by the generally lower
complexity of two-class problems.
The simple spectral averaging in the proposed method
lets the new features have a direct interpretation as opposed
to the linear feature combinations resulting from PCA and
DBFE. This, together with the computational efficiency and
classification strength of the proposed method, makes it a
natural candidate when analyzing hyperspectral images.
The algorithm presented in Section II can easily be extended
to use higher order polynomials, but that would preclude
the simple interpretation, and increase the number, of the
resulting features. Another obvious improvement of the proposed method would be to do a feature selection on the
resulting features. Pilot studies indicate that a slight increase
in performance can be achieved this way, but including it
would conceal the simplicity, and dampen the computational
efficiency, of the proposed approach.
VI. C ONCLUSION
This paper has demonstrated that an optimal piecewise
constant function representation of hyperspectral signature
curves in the mean square sense is capable of representing
the data sufficiently to outperform, or match, other feature
reduction methods like PCA, SFS, DBFE and TD-GLDB for
classification purposes. The simple averaging of spectral bands
makes the resulting features directly interpretable in a physical
sense. Using an efficient dynamic programming algorithm the
proposed method can be considered fast, making it a natural
candidate when analyzing hyperspectral images.
ACKNOWLEDGEMENTS
We are grateful to the CESBIO institute in Toulouse, France
for providing the ROSIS (Fontainebleau) dataset.
R EFERENCES
[1] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical
learning: data mining, inference and prediction. Springer, 2001.
[2] S. B. Serpico, M. D’Inca, F. Melgani, and G. Moser, “Comparison of
feature reduction techniques for classification of hyperspectral remote
sensing data,” in Proc. SPIE, Image and Signal Processing for Remote
Sensing VIII, S. B. Serpico, Ed., vol. 4885, 2003, pp. 347–358.
[3] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature
selection in hyperspectral remote sensing images,” IEEE Trans. Geosci.
Remote Sensing, vol. 39, no. 7, pp. 1360–1367, July 2001.
[4] C. Lee and D. A. Landgrebe, “Feature extraction based on decision
boundaries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 4, pp.
388–400, 1993.
[5] L. M. Bruce, C. H. Koger, and J. Li, “Dimensionality reduction of
hyperspectral data using discrete wavelet transform feature extraction,”
IEEE Trans. Geosci. Remote Sensing, vol. 40, no. 10, pp. 2331–2338,
October 2002.
[6] H. Schulerud and F. Albregtsen, “Many are called, but few are chosen.
Feature selection and error estimation in high dimensional spaces,”
Computer Methods and Programs in Biomedicine, vol. 73, no. 2, pp.
91–99, 2004.
[7] S. Kumar, J. Ghosh, and M. Crawford, “Best-bases feature extraction
algorithms for classification of hyperspectral data,” IEEE Trans. Geosci.
Remote Sensing, vol. 39, no. 7, pp. 1368–1379, July 2001.
[8] S. B. Serpico, M. D’Inca, and G. Moser, “Design of spectral channels
for hyperspectral image classification,” in Proceedings of the 2004
International Geoscience and Remote Sensing Symposium (IGARSS ’04),
vol. 2, 2004, pp. 956–959.
[9] A. Henneguelle, J. Ghosh, and M. M. Crawford, “Polyline feature
extraction for land cover classification using hyperspectral data.” in
Proceedings of the 1st Indian International Conference on Artificial
Intelligence, 2003, pp. 256–269.
[10] G. Winkler and V. Liebscher, “Smoothers for discontinuous signals,”
Journal of Nonparametric Statistics, vol. 14, pp. 203–222, 2002.
[11] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote
Sensing. Wiley-Interscience, 2003.
[12] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of the
random forest framework for classification of hyperspectral data,” IEEE
Trans. Geosci. Remote Sensing, vol. 43, no. 3, pp. 492–501, March 2005.
[13] J. Hoffbeck and D. Landgrebe, “Covariance matrix estimation and
classification with limited training data,” IEEE Trans. Pattern Anal.
Machine Intell., vol. 18, pp. 763–767, July 1996.
[14] P.Paclik, S.Verzakov, and R.P.W.Duin, Hypertools 2.0: The toolbox for
spectral image analysis, 2005.
Download