Fast Hyperspectral Feature Reduction Using Piecewise Constant Function Approximations Are C. Jensen, Student member, IEEE and Anne Schistad Solberg, Member, IEEE Abstract— The high number of spectral bands obtained from hyperspectral sensors, combined with the often limited ground truth, solicits some kind of feature reduction when attempting supervised classification. This paper demonstrates that an optimal constant function representation of hyperspectral signature curves in the mean square sense is capable of representing the data sufficiently to outperform, or match, other feature reduction methods like PCA, SFS and DBFE for classification purposes on all of the four hyperspectral datasets that we have tested. The simple averaging of spectral bands makes the resulting features directly interpretable in a physical sense. Using an efficient dynamic programming algorithm the proposed method can be considered fast. Index Terms— Pattern classification, Remote sensing, Feature extraction I. I NTRODUCTION YPERSPECTRAL sensors provide a rich source of information to allow an accurate separation of land cover classes. Often several hundred spectral samples are acquired for every pixel. Unfortunately, the number of pixels available for training the classifiers is often severely limited, and in combination with the high number of spectral bands, the occurrence of the so-called Hughes phenomenon is almost inevitable. Furthermore, the spectral samples often exhibit high correlation adding a redundancy that may obscure the information important for classification. To alleviate these problems one can reduce the dimensionality by feature reduction, or try to regularize parameters by biasing them towards simpler and more stable estimates. This paper focuses on dimensionality reduction. Classical techniques for feature reduction in the pattern recognition literature can be applied by considering the spectral samples as features. This includes feature selection algorithms like sequential forward selection (SFS), forwardbackwards selection, floating search and the more recently proposed fast constrained search algorithm [1][2][3], and linear transformations like principal components transform (PCA), Fisher’s canonical transform, decision boundary feature extraction (DBFE) and the discrete wavelet transform [1][4][5]. The linear transforms have a disadvantage over the selection methods in that direct interpretation of the resulting features is difficult, i.e., the new features are linear combinations of all the spectral bands. On the other hand, selecting features when the number of possible feature candidates is large and the number of training samples is limited, can lead to incorrect and unstable selections [6]. H This work was supported by the Norwegian Research Council. A. C. Jensen and A. S. Solberg are with the University of Oslo. The method presented in this paper divides the spectral curves into contiguous regions by piecewise constant function approximations. The extracted constants are then used as new features. The assumption is that the continuous curves that the features are samples of can be approximated sufficiently by such functions to allow subsequent classification. By using piecewise constants instead of higher order polynomials, the number of resulting features is minimized and the features are simple averages of contiguous spectral bands allowing straightforward interpretation. This gives a compromise between the generality of linear transforms and the interpretability of the feature selection techniques. Averages of contiguous spectral bands is also the end result in the top-down generalized local discriminant bases (TDGLDB) procedure in [7]. Their algorithm recursively partitions the spectral curve into two sets of bands and replaces each final set of bands by their mean value. The criterion for selecting the partitioning breakpoint is either the performance on a validation set or using an estimate of the differences in class probability densities. The same paper also proposes a more computationally intensive strategy where the simple averaging to create features are replaced by Fisher’s canonical transform, and using bottom-up search strategies. Both methods are relying on a pairwise classification framework. Another approach based on contiguous band averages is that of [8], which uses adaptations of selection methods to extract non-overlapping spectral regions. Instead of iteratively optimizing a classification criterion as in the latter approach, the proposed method focuses solely on minimizing the representation error when finding the breakpoints between the regions to average. Replacing the piecewise constants by linear functions and minimizing the representational error for the class mean curves using a greedy recursive method have been applied in [9]. In Section II it is shown that this method, with the addition of finding globally optimal breakpoints, can be seen as a special case of the proposed optimization framework when considering piecewise linear instead of piecewise constant segments. This paper shows that the fairly simple and direct approach proposed outperforms, or matches, methods like SFS, PCA, DBFE and TD-GLDB on all of the four hyperspectral data sets that we have tested. II. O PTIMAL P IECEWISE C ONSTANT R EPRESENTATION The goal is to partition the hyperspectral signatures into a fixed number of contiguous intervals with constant intensities minimizing the mean square representation error. Let Sij be the jth feature in the ith pixel of a dataset with a total of N pixels with M features. S = {Sij |1 ≤ i ≤ N, 1 ≤ j ≤ M } is thus the collection of all spectral curves (the entire training dataset) available regardless of class membership. We seek a set of K breakpoints, P = {p1 , p2 , .., pK }, which define the contiguous intervals, Ik = [pk , pk+1 ). Note that the K breakpoints are equal for all the classes. In our model each interval, for every pixel, is represented by a constant, µik ∈ ℜ. The square representation error of the model is thus H= K X N X X (Sij − µik )2 . (1) k=1 i=1 j∈Ik If the breakpoints, P , are given, one ends up with a simple square function w.r.t. the constants, µik , so the minimizer of (1) w.r.t. µik is given by, letting |Ik | denote the number of elements in the k-th interval, µik = 1 X Sij , |Ik | (2) j∈Ik i.e., the mean value of each pixel’s interval between breakpoints. What is left to be determined to minimize (1) is thus the locations of the breakpoints. These breakpoints can be found using dynamic programming. The algorithm presented here is based on [10], with the extension of allowing multiple spectral curves. A more detailed exposition of the dynamic programming can be found in the mentioned reference. The following gives the minimum of (1) for all K. Define B(K, m) = min Hm (3) P,θ,|P |=K where Hm means (1) where only the first m features are used, |P | = K means the use of K breakpoints, and θ is the collection of constants. For M ≥ m ≥ K > 1 we have, since (3) is decreasing with increasing K, the recurrence relation B(K, m) = min B(K − 1, r) + D[r+1,m] (4) Fig. 1. Example of piecewise constant function approximation from the KSC dataset. Number of segments (K+1) is 21. the number spectral bands, the inner loop of the algorithm is fast. Examples of execution times can be found in Section IV. The procedure can be perceived as finding an orthogonal linear transform minimizing the square representational error, with the additional constraint that each row in the resulting matrix satisfies the criterion of representing the mean of a single curve segment. Thus, subsequently, the extracted features are obtained by simple matrix multiplication. A. Possible extensions of the approach To balance possibly unequal numbers of samples in the different classes, and allow a priori class weighting, (1) can be rewritten K X N X p(i) X (Sij − µik )2 H= n(i) i=1 k=1 (6) j∈Ik K−1≤r≤m−1 where D[r,m] = N X i=1 min µ m X (Sij − µ)2 . (5) j=r In short, we end up with Algorithm 1. Algorithm 1 Finding the minimum of (1) for all K. 1. Set B(1, m) = D[1,m] , 1 ≤ m ≤ M . 2. Determine B(K, m), 1 < K ≤ m ≤ M , from (4), keeping ∗ track of rK,m giving its minimum. 3. Determine recursively breakpoints from the minimizers, r∗ , ∗ for B(K, M ), B(K − 1, rK,M ), . . .. where p(i) and n(i) are the a priori likelihood and the number of pixels of the class of the ith pixel, respectively. After a similar alteration on (5), Algorithm 1 is applicable directly. Replacing the simple piecewise constant function model in (1) with higher order polynomials is straightforward, but by using, e.g. a linear model, the number of features for a given set of breakpoints is doubled. This increase in features would most often require subsequent feature selection when applied in a classification setting. However, the method still produces a linear transform, although not orthogonal. B. Summarizing the approach After finding the breakpoints, the constants, µik , from (2) are applied as features in the classification process. An example of a piecewise constant function approximation with K = 20 is shown in Figure 1. Although the complexity of the algorithm is O(N M 3 ), N being the number of pixels and M To summarize the approach, training consists of determining the K breakpoints and the corresponding segment means minimizing the square representation error. The extracted parameters are then used in a regular classifier, in our case a Gaussian Maximum Likelihood classifier. III. E XPERIMENTS A. The Datasets To give a thorough analysis of the performance of the proposed method, it has been applied on four hyperspectral datasets containing widely different types of data, and with dimensions ranging from 81 to 176 bands. The first dataset, ROSIS, is from an airborne sensor, contains forest type data, is divided into three classes, has 81 spectral bands and has a pixel size of 5.6m. The second dataset, DC [11], is from an airborne sensor, contains urban type data, is divided into five classes, and has 150 bands. The third set, KSC [12], is from an airborne sensor (AVIRIS), contains vegetation type data, divided into 13 classes, has 176 spectral bands and has 18m pixels. The last dataset, BOTSWANA [12], is from a scene taken by the Hyperion sensor aboard the EO-1 satellite, contains vegetation type data, is divided into 14 classes, has 145 bands, and has a 30m pixel size. The average number of training pixels per class is 700, 600, 196 and 115 for the respective data sets. B. Methods To evaluate the performance of the different methods we applied the standard approach of separating the available ground-truthed data into roughly equally sized regions for training and testing, and report performance on the test data. As far as possible, we have made sure that all regions for training are spatially disjoint from the regions for testing, to avoid training on neighboring pixels that are correlated with test data. In the case of the two datasets with sufficient numbers of pixels, 5 repeated experiments were designed by sampling randomly equally sized sets for each class from the one half of the available data marked for training the classifier. The proposed method is compared with principal components transform (PCA), sequential forward selection (SFS) using the sum of the estimated Mahalanobis distances as feature evaluation function, decision boundary feature extraction (DBFE) using leave-one-out covariance matrix estimate [13], and top-down generalized local discriminant bases (TDGLDB)[7] using their suggested log-odds probabilities criterion. Simple majority voting was used in the TD-GLDB pairwise framework, and results using the implementation in [14] is reported. The PCA is fast, widely used, and is based on signal representation error, while the DBFE is more elaborate with the inclusion of class separability in the optimization process. The TD-GLDB is based on averaging of adjacent bands, and thus have similarities to the proposed method, even though it requires a pairwise classification setting. More sophisticated methods for feature selection, e.g. forward-backward selection and floating selection [1], have been omitted because of their excessive execution times, and thereby their inappropriateness to be compared with the other faster methods. The extracted features are used in a Gaussian Maximum Likelihood (GML) classifier. The number of features, e.g. the number of principal components in PCA or the number of contiguous intervals in the proposed method, were chosen using standard 10-fold crossvalidation on each of the sampled training sets. The average performance and stability reported is based on the test data in these 5 experiments. Due to lack of knowledge about the true priors of the ground truth classes, the average performance over all classes is reported in the experiments. This is the case for all the data sets. Experiments using the proposed method on data where the mean value for each feature has been set to zero have also been conducted. This was done to segregate the general structure of the spectral curve with the variance found in the data. IV. R ESULTS Table I shows that the overall misclassification rate when the optimal number of features is chosen is lower using the proposed method on three out of four datasets. The sequential forward selection (SFS) method has a slightly better performance on the DC dataset, but the proposed method still outperforms the PCA, DBFE and TD-GLDB approach. Except for the TD-GLDB, the misclassification rates of the methods can, in a natural way, be visualized as functions with regards to the number of features used. Figures 2 and 3 are representative examples showing the performance as such functions. On the BOTSWANA dataset, the proposed method and PCA have quite equal performance, but the proposed method has a slightly sharper drop in misclassification rate in the first part of the graph and a lower misclassification increase as the number of features increase beyond the optimal point. The KSC results shown in Figure 3 show a distinctly lower misclassification rate for the proposed method for large intervals of feature numbers. When the mean curve of the data has been subtracted, the proposed method has a small drop in performance in three out of four datasets, but is still comparable or better than the other methods. The variances of the misclassification rate, i.e., the instabilities of the classifiers, are lower for the proposed method, with an increase when subtracting the mean. Even though the estimated placement of the breakpoints in the proposed method are quite stable across the different trials for each dataset, the chosen number of regions vary, as is seen in Table II. Note that although the chosen number of regions varies with experiments, the error rate remains fairly stable. Algorithm runtimes are shown in Table III. The time it takes to find the possible features using the proposed method is merely a fraction of the time it takes to do the SFSs, and considerably lower than to do the DBFEs, even with the nonoptimized Matlab version of the proposed method used in this study. The execution times for the pairwise strategy of TD-GLDB increase drastically when applied to the datasets with many classes. Of course, the PCA needs only to solve an eigenvalue problem, and takes about a second. V. D ISCUSSION It is apparent from the results that the spectral grouping and averaging in the proposed method are capable of representing the data and capture the essential variation within the spectral curve. When the mean of the data is subtracted, the general form of the spectral curves is removed and only the signal TABLE I T EST RESULTS WITH THE OPTIMAL NUMBER OF FEATURES CHOSEN BY CROSSVALIDATION FOR THE FOUR DATASETS USING PCA, SFS, DBFE AND THE PROPOSED METHOD . M EAN MISCLASSIFICATION RATE AND ONE STANDARD DEVIATION IN PERCENT. ROSIS PCA SFS DBFE TD-GLDB Proposed Proposed(µ=0) 14.79 14.93 16.44 20.28 14.18 14.90 DC ±0.25 ±0.30 ±0.65 ±0.41 ±0.17 ±0.18 3.20 2.85 3.89 6.21 2.96 3.03 ±0.21 ±0.14 ±0.20 ±0.42 ±0.05 ±0.15 KSC BOTSWANA 12.22 12.49 13.33 17.18 10.16 10.39 4.45 5.28 6.20 7.41 3.74 3.29 TABLE II T HE MEAN NUMBER OF FEATURES AND ONE STANDARD DEVIATION FOR THE DIFFERENT FEATURE REDUCTION APPROACHES . PCA SFS DBFE TD-GLDBa Proposed Proposed(µ=0) a The ROSIS DC KSC BOTSWANA 17.2 ±3.8 19.6 ±2.9 36.2 ±10.6 6.9 ±0.37 20.0 ±1.4 21.2 ±1.6 73.4 ±27 72.8 ±12 141 ±5.5 6.8 ±0.67 64.8 ±20 57.8 ±22 14 18 11 8.6 21 19 14 16 20 8.6 12 20 means of the pairwise classifiers are reported. TABLE III E XECUTION TIMES IN SECONDS TO FIND ALL FEATURE SETS RANGING FROM ONE TO FULL DIMENSION . SFS DBFE TD-GLDBa Proposed a TD-GLDB ROSIS DC KSC BOTSWANA 164 21 35 19 1255 207 226 178 2916 420 2273 239 1509 295 2239 85 finds final features directly. Fig. 2. Misclassification rate for different numbers of features using PCA, SFS and the proposed method on the BOTSWANA dataset. Crossvalidation estimated 12, 14, 16, and 20 features for the proposed method, PCA, SFS and DBFE, respectively. Fig. 3. Misclassification rate for different numbers of features using PCA, SFS and the proposed method on the KSC dataset. Crossvalidation estimated 21, 14, 18, and 11 features for the proposed method, PCA, SFS and DBFE, respectively. variance is used to determine the locations of the breakpoints for the constant function intervals. In this case, the results are generally the same, but with a little increase in misclassification error. The reason for the decrease in performance is probably that the variance dominates the process of selecting the breakpoints anyway, and that the general form of the spectral curve helps choose more general spectral curve-fitting intervals. The increase in classification stability aids the latter explanation. Consequently, in an actual application of the proposed method, the removal of the mean is generally not to be recommended. The TD-GLDB, with its pairwise classification framework, gives some classes improved discrimination, even though the overall misclassification rate is significantly higher. Also, it might be that a more advanced voting scheme would yield a slightly improved classification rate when using TD-GLDB. However, the (C − 1)C/2 time complexity increase for a Cclass problem using its pairwise strategy renders the method unfit for fast feature reduction when the classes are numerous. The low mean number of selected features using TD-GLDB as seen in Table II can be explained by the generally lower complexity of two-class problems. The simple spectral averaging in the proposed method lets the new features have a direct interpretation as opposed to the linear feature combinations resulting from PCA and DBFE. This, together with the computational efficiency and classification strength of the proposed method, makes it a natural candidate when analyzing hyperspectral images. The algorithm presented in Section II can easily be extended to use higher order polynomials, but that would preclude the simple interpretation, and increase the number, of the resulting features. Another obvious improvement of the proposed method would be to do a feature selection on the resulting features. Pilot studies indicate that a slight increase in performance can be achieved this way, but including it would conceal the simplicity, and dampen the computational efficiency, of the proposed approach. VI. C ONCLUSION This paper has demonstrated that an optimal piecewise constant function representation of hyperspectral signature curves in the mean square sense is capable of representing the data sufficiently to outperform, or match, other feature reduction methods like PCA, SFS, DBFE and TD-GLDB for classification purposes. The simple averaging of spectral bands makes the resulting features directly interpretable in a physical sense. Using an efficient dynamic programming algorithm the proposed method can be considered fast, making it a natural candidate when analyzing hyperspectral images. ACKNOWLEDGEMENTS We are grateful to the CESBIO institute in Toulouse, France for providing the ROSIS (Fontainebleau) dataset. R EFERENCES [1] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction. Springer, 2001. [2] S. B. Serpico, M. D’Inca, F. Melgani, and G. Moser, “Comparison of feature reduction techniques for classification of hyperspectral remote sensing data,” in Proc. SPIE, Image and Signal Processing for Remote Sensing VIII, S. B. Serpico, Ed., vol. 4885, 2003, pp. 347–358. [3] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature selection in hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sensing, vol. 39, no. 7, pp. 1360–1367, July 2001. [4] C. Lee and D. A. Landgrebe, “Feature extraction based on decision boundaries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 4, pp. 388–400, 1993. [5] L. M. Bruce, C. H. Koger, and J. Li, “Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction,” IEEE Trans. Geosci. Remote Sensing, vol. 40, no. 10, pp. 2331–2338, October 2002. [6] H. Schulerud and F. Albregtsen, “Many are called, but few are chosen. Feature selection and error estimation in high dimensional spaces,” Computer Methods and Programs in Biomedicine, vol. 73, no. 2, pp. 91–99, 2004. [7] S. Kumar, J. Ghosh, and M. Crawford, “Best-bases feature extraction algorithms for classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sensing, vol. 39, no. 7, pp. 1368–1379, July 2001. [8] S. B. Serpico, M. D’Inca, and G. Moser, “Design of spectral channels for hyperspectral image classification,” in Proceedings of the 2004 International Geoscience and Remote Sensing Symposium (IGARSS ’04), vol. 2, 2004, pp. 956–959. [9] A. Henneguelle, J. Ghosh, and M. M. Crawford, “Polyline feature extraction for land cover classification using hyperspectral data.” in Proceedings of the 1st Indian International Conference on Artificial Intelligence, 2003, pp. 256–269. [10] G. Winkler and V. Liebscher, “Smoothers for discontinuous signals,” Journal of Nonparametric Statistics, vol. 14, pp. 203–222, 2002. [11] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. Wiley-Interscience, 2003. [12] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of the random forest framework for classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sensing, vol. 43, no. 3, pp. 492–501, March 2005. [13] J. Hoffbeck and D. Landgrebe, “Covariance matrix estimation and classification with limited training data,” IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp. 763–767, July 1996. [14] P.Paclik, S.Verzakov, and R.P.W.Duin, Hypertools 2.0: The toolbox for spectral image analysis, 2005.