Information-Theoretic Feature Selection for 1 Classification: Applications to Fusion of Hyperspectral and LiDAR Data Qiang Zhang, Victor P. Pauca, Robert J. Plemmons, Robert Rand, and Todd C. Torgersen, Abstract In an era of massive data gathering, analysts are facing an increasingly difficult decision regarding the amount of data features to consider for the purpose of target detection and classification. While more data may offer additional information at the cost of potentially prohibitive increase in computation time, it is unclear whether classification accuracy increases accordingly. In this study, we address this question from an information-theoretic standpoint, providing upper and lower bounds of misclassification error, using Chernoff Information (CI) as an upper bound and the resistor-average distance as a lower bound. Our results are applied to the problem of classification from the fusion of real hyperspectral and LiDAR data, using numerical experiments and analysis. The results show that, in principle, adding extra data features does not decrease the CI, or equivalently the classification accuracy. However, empirically this is not the case, and our results show that in fact the indiscriminate inclusion of more data can lead to reduced classification accuracy. We offer a hill-climbing algorithm to incrementally rank and reorder features in such a way that adding an extra feature to any first-k features does not lead to information loss or, equivalently, reduction in classification accuracy. Q. Zhang is with the Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, North Carolina. Victor P. Pauca is with the Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina. Robert J. Plemmons is with the Departments of Computer Science and Mathematics, Wake Forest University, Winston-Salem, North Carolina. Robert Rand is with the National Geospatial-Intelligence Agency. Todd C. Torgersen is with the Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina. April 8, 2014 DRAFT Index Terms Chernoff information, feature selection, high-dimensional data classification, fusion of hyperspectral and LiDAR data, remote sensing. I. I NTRODUCTION The science of remote sensing for the environment and other applications in its broadest sense includes general aerial images, satellite, and spacecraft observations of parts of the earth. Examples of sensing modalities include, e.g., hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data. HSI remote sensing technology allows one to capture images using a range of spectra from ultraviolet to visible to infrared. Hundreds of images of a scene or object are created using light from different parts of the spectrum. These hyperspectral images can be used, for example, to detect objects at a distance, to identify surface minerals, objects, and buildings from space, and space objects from the ground, leading to useful classifacation of data features. LiDAR is a remote sensing technology that measures distance by illuminating a target with a laser and analyzing the reflected light. Each LiDAR point has the position and color in a (x,y,z,R,G,B) format, along with range information. Its wide applications include geography, archaeology, geology, forestry, remote sensing, and atmospheric physics [1]. A tremendously rich literature exists on material identification, clustering, or spectrally unmixing HSI data, such as the match filter and related methods [2], [3], principal component analysis (PCA) [4], nonnegative matrix factorization (NMF) [5], [6], [7], independent component analysis (ICA) [8], [9], support vector machine (SVM) [10], projection pursuit [11] and hierarchical clustering models [12]. In addition, recent advances in sensor development have allowed for both HSI and LiDAR data to be captured simultaneously [13], [14], thus providing both geometrical as well as spectral information for a given scene. A typical dataset collected in this way can contain millions of LiDAR points and hundreds of gigabyte HSI images, see, e.g., [15], [16], [17], [18]. The fusion of multi-modality data, e.g., HSI and LiDAR, is an important, but challenging, aspect of remote sensing. The huge amounts of data, provided by conventional and multi-modal sensing platforms, present a significant challenge regarding developments of effective computational techniques for processing, storage, transmission, and analysis of such data. 2 Facing these large amounts of data, analysts often have to make a first decision on whether to use the entire dataset or a subset for a given classification algorithm. Quite often the dataset becomes so large that it is computationally prohibitive to use the whole dataset, and thus the question becomes which subset to choose. Another similar question arises that whether hundreds of HSI images plus millions of LiDAR points would guarantee a better identification or classification of objects. The curse of dimensionality is a well-known fact, that is better described in the following [19], dmax − dmin = 0, p→∞ dmin lim (1) where p is the number of dimensions, and dmax and dmin are respectively the maximum and the minimum distances to differentiate two objects. In other words, it becomes harder and harder to differentiate two objects when data dimension increases. Clearly, more is not necessarily better, but could it be worse? Using the misclassification error as the criterion for evaluating whether more data would be better or worse for classification, we do not rely on any specific classification algorithms, such as PCA or SVM, but establish both an upper bound of the misclassification error, namely the Chernoff information [20], and a lower bound, namely the resistor-average distance (RAD) [21], a symmetrized Kullback-Leibler distance. This approach avoids empirically comparing different match filters or clustering algorithms, some of which often excel in one dataset but fall behind in others, while the bounds allow us evaluate which subspaces would be more appropriate in classifying objects. In this study we demonstrate that, theoretically, adding more data will not decrease the classification accuracy, but empirically this is not necessarily true. For a real hyperspectral dataset, we have shown that incrementally adding bands from shorter to longer wavelengths can at some point reduce the Chernoff information, meaning that adding more data could instead reduce the classification accuracy, as also observed in [22]. This raises a caution on approaches arbitrarily adding extra dimensions of data, e.g., concatenating data from different modalities such as HSI and LiDAR. On the other hand, the Chernoff information may also provide us an understanding of the most information-rich dimensions or bands, e.g., a subset of wavelengths that would best differentiate one material from others. 3 In the literature of variable selection for high-dimensional clustering, or equivalently the band selection in hyperspectral data analysis, various importance measures have been proposed, e.g., density-based methods such as the excess mass test for multimodality, [23], [24], [25], criterionbased methods such as sparse k-means [26] using Between-Clusters Sum of Square (BCSS), or model-based methods such as Bayesian Information Criterion [27]. The band selection is also an important research topic in the hyperspectral imaging, where we see information-theoretic approaches such as [28], [29], and for more comprehensive comparisons, see, e.g., [30]. In the following discussions, by a feature we mean a variable either as a spectral band or a LiDAR variable for each spatial pixel. The differences between our approach and variable-selection, also known as band-selection, methods in the hyperspectral literature lie in several aspects. First, rather than evaluating each feature separately, such as by the information entropy and the contrast measure [31], or in a small neighborhood of features, such as by the first and second spectral derivatives [32], the spectral ratio measure [33], and the correlation measure [34], the upper and lower bounds of the misclassification error are computed using a subset of features. A hill-climbing algorithm using the upper bound ranks and adds features incrementally from the most information-rich to the least, and each new feature’s information is evaluated conditional upon the set of already chosen features. Second, the information bounds are independent of any particular classification algorithm or dataset, while guaranteeing that no classification algorithm will perform better or worse than these bounds. Our proposed method is much easier to compute and can be applied to not only hyperspectral data but also LiDAR, while some of the band-selection methods are devised specifically for hyperspectral data. For example, the spectral-derivative methods depend on the smoothness of neighboring-band variations that may not be applicable to LiDAR, which often shows abrupt changes. Third, unlike the fixed measures such as the information entropy or the spectral derivatives, the ranks of the features can vary depending upon the underlying statistical models assumed by classifiers. This third aspect needs to raise enough attention to analysts that rather than feeding a chosen subset of data to a classifier without knowing its underlying statistical assumptions, we should be more aware of these assumptions to an extent that information-theoretic bounds can 4 be established to validate the empirical results which can vary significantly from one dataset to another. The structure of the paper is as follows. In Sec. II, we introduce the definition of Chernoff information and its fast computation, specifically for the exponential family of distributions. Some empirical results are presented here to illustrate further developments. In Sec. III and IV, we discuss the feature selection method based on Chernoff information that employs all the bands while guaranteeing a nondecreasing Chernoff information sequence. In Sec. V, the method is illustrated through the problem of fusing real HSI and LiDAR datasets. We then conclude in Sec. VI with discussions. II. C HERNOFF I NFORMATION The expected cost of misclassification (E), also called the Bayesian Error (Risk), is formulated as, E= X i pi X P (k|i), (2) k6=i where pi is the prior probability of class (object/target) i, P (k|i) is the probability of classifying class i as k. E is minimized if an observed vector x in population k satisfies pk fk (x) > pi fi (x), i 6= k, (3) where fi (x) is the distribution of class i. If we assume a uniform prior distribution, then the classification rule can be stated as: find k, such that fk (x) > fi (x), i 6= k. (4) Consider a univariate two-class case in which each class has a normal distribution, as shown in Fig. 1, where fi , i = 1, 2, is a normal distribution with mean µi , and the decision region is 2 2 ) for class 1 and ( µ1 +µ , +∞) for class 2. Moving the cut point to the left or right (−∞, µ1 +µ 2 2 of (µ1 + µ2 )/2 increases E, shown as the shaded area in Fig. 1. To establish an upper bound on the classification error, we use the two-class case as an example, as Z E= min(p1 f1 , p2 f2 )dx. x 5 (5) Fig. 1. Classification error of a univariate two-class case. Using the following relationship, min(a, b) ≤ aα b1−α , ∀α ∈ [0, 1], (6) we obtain a first bound on E as E≤ pα1 p1−α 2 Z f1α (x)f21−α (x)dx. (7) x Define Chernoff α-divergence as, Z Cα (f1 : f2 ) = − log f1α (x)f21−α (x)dx, (8) x and Chernoff information as, C ∗ (f1 : f2 ) = max Cα (f1 : f2 ). α∈[0,1] (9) Then we have ∗ ∗ E ≤ pα1 p1−α exp [−C ∗ (f1 : f2 )] , 2 (10) and for the uniform prior distribution case, we simply have E ≤ exp [−C ∗ (f1 : f2 )] . (11) Clearly, higher C ∗ means lower misclassification error and hence will be more desirable. From the definitions of Chernoff α-divergence and Chernoff information, we know they are both nonnegative and asymmetric, i.e., Cα (f1 : f2 ) 6= Cα (f2 : f1 ), but Cα (f2 : f1 ) = C1−α (f1 : f2 ). 6 Applications of Chernoff information include, e.g., sensor networks [35], image segmentation and registration [36], [37], face recognition [38] and edge segmentation [39]. Next, we look at some basic properties of Chernoff information. In multivariate distributions, if we have one variable independent from the rest, we know Chernoff α-divergence or Chernoff information of all variables would be decomposable. This is stated in the following proposition. Proposition 1. Let f1 and f2 be two joint distributions of a d-variate random vector x = (x1 , . . . , xd ) and a random variable y. If y is independent from x, then Cα (f1 , f2 ) = Cα (f1x , f2x ) + Cα (f1y , f2y ), (12) where f1x and f2x are the marginal distributions of x, and f1y and f2y are the marginal distributions of y. The proof is simply from the decomposition of the distribution functions, f1 and f2 , due to the independence assumption, and hence ignored here. Proposition 1 tells us that adding independent variables for classification would never reduce Chernoff information, or to increase the upper bound of misclassification error. If variable y is not independent from x, the Chernoff α-divergence between f1 and f2 would still be no less than that between f1x and f2x . Proposition 2. If y is not independent, then Cα (f1 , f2 ) ≥ Cα (f1x , f2x ). (13) Proof: By the conditional distribution, fi (x, y) = fi (y|x)fix (x), i = 1, 2, we have Z Z f1α (x, y)f21−α (x, y)dxdy Ωx Z Ωy Z = Ωx α f1x (x)f1α (y|x) Ωy 1−α f2x (x)f21−α (y|x)dydx. Denote gα (x) = R Ωy (14) f1α (y|x)f21−α (y|x)dy, and notice that ∀x, ∀α, 0 < gα (x) < 1. Hence, the 7 integral above becomes Z 1−α α (x)gα (x)dx (x)f2x f1x Ωx Z α 1−α f1x (x)f2x (x)dx. ≤ (15) Ωx Hence, Cα (f1 , f2 ) ≥ Cα (f1x , f2x ) by definition. Propositions 1 and 2 confirm that adding a variable, regardless of being independent or not, should not reduce Chernoff information, or increase the misclassification error. There is generally no closed form of Chernoff information, except special cases such as the 1-parameter exponential family. For example, the Chernoff information between two univariateGaussian distributions with a common standard deviation is C ∗ (f1 : f2 ) = 1 (µ1 − µ2 )2 . 8σ 2 (16) However, if we assume the class distributions belong to the exponential family, the relationships between Chernoff α-divergence and Jensen α-divergence and Bregman divergence will lead to a simple algorithm for computing CI [20]. The exponential family of distributions can be written as the following, p(x) = h(x)ehθ,f (x)i−F (θ) , (17) where θ is the natural parameter vector, f (x) is the sufficient statistics vector, F (θ) is the cumulative generating function, and h is simply a normalizing function. For example, the multivariate normal distribution, N (µ, Σ), has θ = (Σ−1 µ, − 21 Σ−1 ), f (x) = (x, xxT ), and F (θ) = 21 µT Σ−1 µ + 21 log det(2πΣ). Jensen α-divergence on the natural parameters is defined as Jα (θ1 , θ2 ) = αF (θ1 ) + (1 − α)F (θ2 ) + F (αθ1 + (1 − α)θ2 ), (18) and Bregman divergence on the natural parameters is defined as B(θ1 : θ2 ) = F (θ1 ) − F (θ2 ) − [θ1 − θ2 ]F 0 (θ1 ), where F is the cumulative generating function in the definition of the exponential family. In the exponential family of distributions, we have the following relationships, 8 (19) • Chernoff α-divergence on members of the same exponential family is equivalent to a Jensen α-divergence on the corresponding natural parameters; • the maximal Jensen α-divergence can be computed as an equivalent Bregman divergence; • Bregman divergence is also the same as the Kullback-Leibler divergence. The former is a distance acting on natural parameters, while the latter is a distance on cannonical parameters. These relationships help us avoid the integration and optimization steps in the computation of Chernoff information, but to use a simple geodesic-bisection algorithm to find the optimal α∗ . The algorithm is summized in pseudo-codes in Algorithm 1. Input : natural parameters of two distributions belonging to the same exponential family, θ1 and θ2 . Output: α∗ and θ∗ = (1 − α∗ )θ1 + α∗ θ2 . 1 Initialize αm = 0, αM = 1. 2 while αM − αm > tol do 3 Compute midpoint, α0 = (αm + αM )/2. 4 if B(θ1 : θ) < B(θ2 : θ) then 5 6 7 8 9 αm = α0 ; else αM = α0 ; end end Algorithm 1: A geodesic-bisection algorithm for computing the Chernoff information. For a size-n sample, Chernoff information asymptotically upper-bounds the misclassification error [40], i.e., lim E = e−nC(f1 ,f2 ) . n→∞ (20) The error also has an arbitrarily tight lower bound by an information-theoretic distance [41], lim E = e−nC(f1 ,f2 ) ≥ e−nR(f1 ,f2 ) , n→∞ (21) where R(f1 , f2 ) is the resistor-average distance (RAD) [21] defined as R(f1 , f2 ) = (KL(f1 , f2 )−1 + KL(f2 , f1 )−1 )−1 , 9 (22) and KL(f1 , f2 ) is the Kullback-Liebler (KL) distance from f1 to f2 . As stated before, the exponential-family distribution allows computing KL distance with Bregman divergence on the natural parameters, from which we can derive RADs between pairs of distributions, and hence, a lower bound of the misclassification error. III. C HERNOFF I NFORMATION AND H IGH - DIMENSIONAL C LUSTERING The multivariate normal (MVN) distribution is the most popular distribution assumed in clustering HSI and LiDAR data, e.g., the matched filter [2], [3], the orthogonal subspace projection [42], the adaptive match filter (AMF) [43], and the adaptive cosine estimator (ACE) [44]. Hence, in this section we focus on this particular member of the exponential family, even though Algorithm 1 can be applied to any member in this family. Let f1 ∼ N (µ1 , Σ1 ) and f2 ∼ N (µ2 , Σ2 ) represent two d-variate normal distributions. Then the Chernoff α-divergence between these two distributions is, |αΣ1 + (1 − α)Σ2 | 1 log 2 |Σ1 |α |Σ2 |1−α α(1 − α) (µ1 − µ2 )T [αΣ1 + 2 + (1 − α)Σ2 ](µ1 − µ2 ). Cα (f1 , f2 ) = (23) Now we add an extra dimension to both p and q, and denote p̃ ∼ N (µ̃1 , Σ̃1 ) and q̃ ∼ N (µ̃2 , Σ̃2 ) as two d + 1-dimensional normal distributions, where µ̃1 = (µ1 ; m1 ), µ̃2 = (µ2 ; m2 ), and Σ2 s2 Σ1 s1 . , Σ̃2 = Σ̃1 = sT2 σ22 sT1 σ12 (24) Proposition 2 tells us that Cα (p̃, q̃) ≥ Cα (p, q), and hence C ∗ (p̃, q̃) ≥ C ∗ (p, q). To validate this empirically, we select a small HSI dataset of size 60×100×58, with the first two numbers being sizes of two spatial dimensions and the last number being the size of the spectral dimension. For the details on the dataset, please see Sec. V. We transform the dataset into a matrix of size 6, 000 × 58, and incrementally increase the number of bands, d, from 1 to 58 following the wavelength order. At each d, we feed the data to the k-means clustering algorithm, and then use the resulting centroids and pixel memberships to calculate the cluster-conditional means and covariance matrices, from which we can compute CIs between pairs of cluster-conditional 10 3.2 3 Minimum Chernoff Information 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 0 Fig. 2. 10 20 30 Bands 40 50 60 Mininum Chernoff information between pairs of clusters by k-means versus data dimension. distributions. We then take the minimum CI between all pairs which represents the largest possible misclassification error and plot it against the number of bands used in clustering, shown in Fig. 2. Here we set the number of clusters, k, as 5. From Fig. 2, we observe a general increasing trend, but also ups-and-downs in the middle, which clearly violates Proposition 2. Investigating this problem, we find that the empirical covariances of subspaces are changing when a new band is added, i.e., Σ̃1:d,1:d 6= Σ, or the assumption in Eq. (24) is violated. Here, Σ̃1:d,1:d indicates the first d-columns and first d-rows of Σ̃. The only possible reason for this is that the cluster memberships of some pixels switched after a new band is added. Figure 3 shows the number of pixels that switched their class membership after a new band is added. In the worst case, we could have close to 30% pixels changing memberships, as shown by the peak. Clearly, though correct classification could be done in lower dimensions, adding more data can potentially alter these classifications leading to wrong conclusions, and the membership-switching problem presents a challenge in clustering high dimensional data that is worth further investigation. As claimed by Proposition 1, adding an independent variable would not decrease Chernoff information, and because PCA would provide us with the space spanned by orthogonal principal components, namely the principal component (PC) space, we can test the trend of CI after incrementally adding PCs. Another concern in high dimensional data clustering is that the data 11 Number of pixels that switched membership 1800 1600 1400 1200 1000 800 600 400 200 0 Fig. 3. 0 10 20 30 Bands 40 50 60 The number of pixels that switched its membership when a new dimenions is added. variances at different bands can also alter clustering results, see, e.g., [45], [46]. For example, adding a large-variance variable could easily alter sample memberships as illustrated in Fig. 4, where we see point A in the one-dimensional case, on the horizontal axis, would be classified as the red class, but in the two-dimensional case, it would be classified as the blue class. The two centriods, C1 and C2 , are more separated on the vertical axis, indicating a larger overall variance on this second dimension. For this reason, it might be advisable to order features by their variances, which is also tested here. Similarly, one could argue that normalizing each band by its variance could help avoid the membership switching. All three cases are tested and shown in Figure 5. The PCA approach clearly enjoys a monotonically increasing trend after incrementally adding PCs, while ordering by variance also results in a monotonic increasing trend, although it rises slower than the CI sequence by PCA in the beginning, but we observe a sharp jump at the 35th ordered variable. Disappointingly, normalizing bands does not help in removing the ups-and-downs, as seen on the black CI curve. IV. F EATURE S ELECTION WITH C HERNOFF I NFORMATION Though PCA enjoys the clear advantage given its orthogonality property in the PC space, many times researchers would like to classify in the original feature space, which would be more physically or chemically meaningful. For example, it is hard to physically interpret PC 12 14 12 C 1 10 8 6 C 4 2 2 0 Fig. 4. A 0 2 4 6 8 10 12 An example of membership switching after adding an extra dimension. 3.5 PCA Original Ordered by variances Normalized Minimum Chernoff Information 3 2.5 2 1.5 1 Fig. 5. 0 10 20 30 Bands Comparison of the minimun CIs with different orders of data. 13 40 50 60 vectors, which must contain negative entries, in the HSI or LiDAR data. In the original feature space, besides ranking by variances, we can also rank them by various importance measures, as suggested in the introduction section, e.g., the excess-mass measure [23], [24], [25] or the BCSS used in k-means. For example, given a HSI dataset, rather than following the wavelength order, we can rank and choose the most important bands before applying clustering algorithms. Here, we will introduce a hill-climbing scheme that would incrementally search and add the most information-rich set of features, while guaranteeing no loss of information in terms of classification. The end result would be a set of chosen features, which can then be reordered by, for example, the wavelength order for algorithms such as spectral unmixing. Starting from an empty set X̂, at each iteration the algorithm searches for a variable, Xi , that provides us the largest minimum CI between all pairs of clusters, i.e., î = max min CI(fˆi , fˆk ), i j,k,j6=k (25) where fˆk is the empirical distribution of the k th cluster estimated from the data, X̂ concatenated with Xi . The algorithm then adds Xî to X̂. Denote the maximum CI at the j th iteration as cj and the sample membership set as Mj . To guarantee a nondecreasing sequence of CI, if cj < cj−1 , we use Proposition 2 by fixing the membership as the last iteration, i.e., Mj = Mj−1 , and recompute CIs between all pairs before finding î. The pseudo-codes are included in Algorithm 2. Compared to the excess-mass measure or the BCSS, Chernoff information is more flexible with model assumptions, because it allows a wide variety of exponential-family distributions, while the excess-mass measure usually requires continuous data and the BCSS assumes the homoscedasticity, or the equal variance. It also helps on choosing the number of clusters, since the minimum CI between all pairs of clusters would dwindle with an increasing number of clusters. The fixing-membership idea also ensures no loss of information or features, especially when noisy or highly correlated data is added that could potentially alter the correct classification. V. E XPERIMENTS The HSI and LiDAR datasets were simultaneously acquired by sensors on the same aircraft over Gulfport, Mississippi [9] in 2010. The ground sampling distance is 1 meter. The original 14 Input : d-dimensional data, X. Output: Reordered d-dimensional data,X̂, and a CI sequence. 1 Initialize an index set as S = {1, 2, . . . , d}. 2 X̂ = ∅, j = 1. 3 while S 6= ∅ do 4 for i ∈ S do 5 Xi = concatenate(X̂, Xi ). 6 Cluster Xi using, e.g., k-means, and assign membership to Mj . 7 Estimate Σk and µk . 8 Compute CIijk between all pairs of clusters. 9 end 10 (î, ĵ, k̂) = maxi minj,k,j6=k CIijk . 11 cj = CIîĵ k̂ . 12 if cj < cj−1 then 13 Mj = Mj−1 . 14 Repeat steps 4 to 11. 15 end 16 X̂ = concatenate(X̂, Xî ). 17 S = S \ î. 18 j = j + 1. 19 end Algorithm 2: A hill-climbing algorithm for reordering features. HSI dataset has 72 bands ranging from 0.4µm to 1.0µm. After removing noisy bands, we use 58 bands on a 60 × 100 scene, and the coregistered LiDAR point cloud is preprocessed as the implicit geometry (IG) functions, including the population function, the validity function and the distance function, each containing 12 features. For details of IG functions, see [47]. The scene, as shown in Fig. 6, contains parking lots, buildings, vehicles, grass and trees. The fusion of HSI and LiDAR data considered next is on the feature level, meaning we simply 15 10 20 30 40 50 60 20 40 60 80 100 20 40 60 80 100 10 20 30 40 50 60 50 100 150 200 250 300 350 50 Fig. 6. 100 150 200 250 300 350 400 450 The top figure shows a false-color image of an HSI image at 0.72µm. The middle figure shows the height profile of the scene from LiDAR. The bottom figure shows a 3D aerial view of the scene by Google Map (Imagery@2014 Google, Map data@2014 Google). 16 concatenate two datesets together, with a total of 70 features. The purposes of the following experiments include: first, whether concatenating HSI data with LiDAR would increase the classification accuracy; second, compare the hill-climbing approach with other three ordering schemes, namely the natural wavelength order, the PC order and the variance order; third, among the three IG functions which would perform the best. The experiments are set up in such a way that for each combination of HSI and LiDAR IG, i.e., HSI only, HSI + Validity, HSI + Distance, and HSI + Population, we will compute the CI sequences of four different orders: the wavelength order, the PC order, the variance order and the hill-climbing order. We use the k-means clustering algorithm and set the number of clusters at 5. We derive the empirical cluster-conditional means and covariances from the pixel memberships given by k-means, before transforming them into natural parameters and computing the statistical distances. Each subplot in Fig. 7 shows four curves of CIs which correspond to four different combinations of HSI and LiDAR features. From the up-and-downs in all four HSI-only curves, we can see that adding features following the wavelength order could reduce CIs. Adding LiDAR features does increase CIs, or reduce the misclassification errors. The PC order mostly enjoys an increasing sequences of CIs as more PCs are added, though we observe occasionally small fluctuations, possibly due to the sensitivity of k-means algorithm to initial conditions. Ordering features by their variances would not guarantee nondeceasing sequences of CIs, as we observe some large drops in the green curves. At last, the hill-climbing approach not only enjoys nondeceasing sequences of CIs, it also has the largest CIs in the end when all features are included. The largest increase in CI is observed by HSI and the Distance function. With the same experimental setup, we also computed the RADs, shown in Fig. 8, the exponential function of which provides us lower bounds of misclassification error. As expected, they have greater values than CIs, and mostly follow the same trends, though in the hill-climbing approach they could have occasional drops. Note that the hill-climbing algorithm can also be based on the RADs, rather than CIs, which would provide a nondecreasing sequence of RADs. However, we cannot guarantee nondecreasing sequences for both RAD and CI. For misclassification errors, we tranform the CIs and RADs into the upper and lower bounds of E, by e−R(f1 ,f2 ) ≥ E ≤ e−C(f1 ,f2 ) , and plot them in Fig. 9. More directly, we can see the combination of HSI with the Distance 17 Natural Order PC Order 6 6 HSI HSI + Validity HSI + Distance HSI + Population 5 4 4 3 2 2 1 0 10 20 30 40 50 60 0 70 0 10 20 Variance Order 30 40 50 60 70 50 60 70 Hill−Climbing 6 10 8 4 6 4 2 2 0 Fig. 7. 0 10 20 30 40 50 60 0 70 0 10 20 30 40 The CIs of increasing numbers of features in four different orders. function has the quickest drop in error when more features are added, and for a given tolerance level this combination would need the smallest number of features. For example, with only 10 features, HSI+Distance would guarantee a misclassification error below 0.01. VI. C ONCLUSIONS AND D ISCUSSIONS In this study we have introduced both upper and lower bounds of misclassification errors by information-theoretic statistical measures, that would help us identify and collect features for the best classification accuracy, without reliance on any particular classification algorithm. Recent studies of Chernoff information allow us to compute it quickly with a geodesic-bisection algorithm on the exponential family of distributions, which is often widely assumed in analyzing hyperspectral and LiDAR data. We have shown that, theoretically, adding more features should not reduce Chernoff information, or the classification accuracy, but empirically on sample data, 18 Natural Order PC Order 20 20 HSI HSI + Validity HSI + Distance HSI + Population 15 10 15 10 5 0 5 0 10 20 30 40 50 60 0 70 0 10 20 Variance Order 30 40 50 60 70 50 60 70 Hill−Climbing 20 50 40 15 30 10 20 5 0 Fig. 8. 10 0 10 20 30 40 50 60 0 70 0 10 20 30 40 The RADs of increasing numbers of features in four different orders. adding more features does not guarantee Chernoff information will not be lost. Facing this problem, we have devised a hill-climbing algorithm to reorder the features based on their contribution to Chernoff information, and hence to guarantee a nondecreasing sequence of CI, as more features are added. The proposed approach does not remove any features, contrary to bandselection methods, while it is also flexible in the use of underlying distributions of clustering models. We have learned several things from this study. First, there are not necessarily “bad bands" or “good bands" in all hyperspectral images, which often reminds us of data quality measures such as the signal-to-noise ratio (SNR). For example, in Fig. 4, data in both the x-axis and the yaxis, appear to show a fairly good two-cluster pattern, but which can still cause the membership switching of certain points. Only in the more extreme cases, when the SNRs of certain bands 19 HSI HSI + Validity 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 10 20 30 40 50 0 60 0 10 HSI + Distance 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 Fig. 9. 0 10 20 30 40 30 40 50 60 70 HSI + Population 0.25 0 20 50 60 0 70 RAD CI 0 10 20 30 40 50 60 70 The misclassification error bounds of features selected by the hill-climbing algorithm. are so low or the data are so noisy that they are almost random, we should consider calling these “bad bands" and remove them. Second, because of the dependence of the CI on the assumption of statistical models, the determination of information-rich bands is model-dependent, meaning that one “good" band for a certain model might be “bad" for another model. Hence, analysts might be encouraged to have a good understanding of underlying statistical-model assumptions before applying certain band-selection methods. Third, as the model complexity increases, e.g., when the number of clusters in k-means increases, the membership switching can be more severe and hence we might expect more upsand-downs in the CIs after incrementally adding features, see, e.g., a comparison of CIs of the 3-cluster model with the CIs of the 5-cluster model in Fig. 10. In these cases, we need to be 20 3.5 Chernoff Information 3 2.5 2 1.5 3 clusters 5 clusters 1 0.5 Fig. 10. 0 10 20 30 Bands 40 50 60 Comparison of CIs by the 3-cluster model and CIs by the 5-cluster model. more careful in choosing the features to use. Also, the membership switching is not necessarily only happening to the more rudimentary clustering algorithms, such as k-means, because as clusters are less separated and as the number of clusters increases, we can still observe similar behaviors by more sophisticated algorithms. Fourth, it might not be a good practice to incrementally add spectral bands following the natural wavelength order as shown in Fig. 2. In our experiments, the wavelength order appears to show poor CI sequences. We suggest that analysts consider using feature-selection techniques, the CI approach being one of them, to rank and reorder the bands before applying their data analysis methods. However, for physical interpretation of clustered data, the information-rich bands selected by the hill-climbing algorithm should be reordered back in the wavelength order. For example, the indices of the top-five ranked spectral bands by the hill-climbing algorithm could be, {30, 27, 3, 50, 10}, and analysts should reorder them as, {3, 10, 27, 30, 50}, before applying clustering algorithms for better interpretation of clustering results. ACKNOWLEDGMENT This work is sponsored by National Geospatial- Intelligence Agency (NGA) contract HM158210-C-0011 and by the Air Force Office of Scientific Research (AFOSR) under Grant FA955021 11-1-0194. R EFERENCES [1] A. P. Cracknell and L. W. Hayes, Introduction to remote sensing, 2 ed. London: Taylor and Francis, 2007. [2] A. D. Stocker, I. S. Reed, and X. Yu, “Multidimensional signal processing for electro-optical target detection,” in OE/LASE’90, 14-19 Jan., Los Angeles, CA. International Society for Optics and Photonics, 1990, pp. 218–231. [3] C. Funk, J. Theiler, D. Roberts, and C. Borel, “Clustering to improve matched filter detection of weak gas plumes in hyperspectral thermal imagery,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 39, no. 7, pp. 1410–1420, 2001. [4] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005. [5] V. Pauca, J. Piper, and R. Plemmons, “Nonnegative matrix factorization for spectral data analysis,” Linear Algebra and its Applications, vol. 416, no. 1, pp. 29–47, 2006. [6] L. Miao and H. Qi, “Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 45, no. 3, pp. 765–777, 2007. [7] X. Liu, W. Xia, B. Wang, and L. Zhang, “An Approach Based on Constrained Nonnegative Matrix Factorization to Unmix Hyperspectral Data,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 49, no. 2, pp. 757–772, 2011. [8] T. Tu, “Unsupervised signature extraction and separation in hyperspectral images: A noise-adjusted fast independent component analysis approach,” Optical Engineering, vol. 39, p. 897, 2000. [9] J. Wang and C. Chang, “Applications of independent component analysis in endmember extraction and abundance quantification for hyperspectral imagery,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 44, no. 9, pp. 2601–2616, 2006. [10] J. Gualtieri and R. Cromp, “Support vector machines for hyperspectral remote sensing classification,” in Proceedings of SPIE- The International Society for Optical Engineering, vol. 3584. Citeseer, 1999, pp. 221–232. [11] S. Chiang, C. Chang, and I. Ginsberg, “Unsupervised target detection in hyperspectral images using projection pursuit,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 39, no. 7, pp. 1380–1391, 2002. [12] J. Tilton and W. Lawrence, “Interactive analysis of hierarchical image segmentation,” in Geoscience and Remote Sensing Symposium, 2000. Proceedings. IGARSS 2000. IEEE 2000 International, vol. 2. IEEE, 2002, pp. 733–735. [13] G. P. Asner, D. E. Knapp, T. Kennedy-Bowdoin, M. O. Jones, R. E. Martin, J. Boardman, and C. B. Field, “Carnegie airborne observatory: in-flight fusion of hyperspectral imaging and waveform light detection and ranging for three-dimensional studies of ecosystems,” Journal of Applied Remote Sensing, vol. 1, no. 1, pp. 013 536–013 536, 2007. [14] P. Gader, A. Zare, R. Close, and G. Tuell, “Co-registered hyperspectral and lidar long beach, mississippi data collection,” University of Florida, University of Missouri, and Optech International, 2010. [15] C. Chang, Hyperspectral data exploitation: theory and applications. Wiley-Interscience, 2007. [16] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectral and lidar remote sensing data for classification of complex forest areas,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 46, no. 5, pp. 1416–1427, 2008. [17] M. T. Eismann, Hyperspectral Remote Sensing. SPIE Press, 2012. [18] J. Schott, Remote sensing: the image chain approach. Oxford University Press, USA, 2007. [19] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is "nearest neighbor" meaningful?” in In Int. Conf. on Database Theory, 1999, pp. 217–235. 22 [20] F. Nielsen, “Chernoff information of exponential families,” arXiv preprint arXiv:1102.2684, 2011. [21] D. H. Johnson and S. Sinanovic, “Symmetrizing the kullback-leibler distance,” IEEE Transactions on Information Theory, vol. 1, no. 1, pp. 1–10, 2001. [22] J. A. Benediktsson, J. R. Sveinsson, and K. Amason, “Classification and feature extraction of aviris data,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 33, no. 5, pp. 1194–1205, 1995. [23] D. W. Müller and G. Sawitzki, “Excess mass estimates and tests for multimodality,” Journal of the American Statistical Association, vol. 86, no. 415, pp. 738–746, 1991. [24] D. Nolan, “The excess-mass ellipsoid,” Journal of Multivariate Analysis, vol. 39, no. 2, pp. 348–371, 1991. [25] Y.-b. Chan and P. Hall, “Using evidence of mixed populations to select variables for clustering very high-dimensional data,” Journal of the American Statistical Association, vol. 105, no. 490, 2010. [26] D. M. Witten and R. Tibshirani, “A framework for feature selection in clustering,” Journal of the American Statistical Association, vol. 105, no. 490, 2010. [27] A. E. Raftery and N. Dean, “Variable selection for model-based clustering,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 168–178, 2006. [28] A. Martinez-Uso, F. Pla, J. Sotoca, and P. Garcia-Sevilla, “Clustering-based hyperspectral band selection using information measures,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 45, no. 12, pp. 4158–4171, 2007. [29] J. M. Sotoca and F. Pla, “Hyperspectral data selection from mutual information between image bands,” in Structural, Springer, 2006, pp. 853–861. Syntactic, and Statistical Pattern Recognition. [30] P. Bajcsy and P. Groves, “Methodology for hyperspectral band selection,” Photogrammetric engineering and remote sensing, vol. 70, pp. 793–802, 2004. [31] J. C. Russ, The image processing handbook. CRC press, 2006. [32] J. C. Price, “Band selection procedure for multispectral scanners,” Applied Optics, vol. 33, no. 15, pp. 3281–3288, 1994. [33] J. B. Campbell, Introduction to remote sensing. Taylor & Francis, 2002. [34] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012. [35] J.-F. Chamberland and V. V. Veeravalli, “Decentralized detection in sensor networks,” Signal Processing, IEEE Transactions on, vol. 51, no. 2, pp. 407–416, 2003. [36] F. Calderero and F. Marques, “Region merging techniques using information theory statistical measures,” Image Processing, IEEE Transactions on, vol. 19, no. 6, pp. 1567–1586, 2010. [37] F. A. Sadjadi, “Performance evaluations of correlations of digital images using different separability measures,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 4, pp. 436–441, 1982. [38] S. K. Zhou and R. Chellappa, “Beyond one still image: Face recognition from multiple still images or a video sequence,” in Face Processing: Advanced Modeling and Methods. Citeseer, 2005, pp. 547–567. [39] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu, “Statistical edge detection: Learning and evaluating edge cues,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 1, pp. 57–74, 2003. [40] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [41] H. Avi-Itzhak and T. Diep, “Arbitrarily tight upper and lower bounds on the bayesian probability of error,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 18, no. 1, pp. 89–91, 1996. [42] J. C. Harsanyi and C.-I. Chang, “Hyperspectral image classification and dimensionality reduction: an orthogonal subspace projection approach,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 32, no. 4, pp. 779–785, 1994. 23 [43] D. G. Manolakis, G. A. Shaw, and N. Keshava, “Comparative analysis of hyperspectral adaptive matched filter detectors,” in PROCEEDINGS-SPIE THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING. International Society for Optical Engineering; 1999, 2000, pp. 2–17. [44] D. Manolakis and G. Shaw, “Detection algorithms for hyperspectral imaging applications,” Signal Processing Magazine, IEEE, vol. 19, no. 1, pp. 29–43, 2002. [45] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” in ACM SIGMOD Record, vol. 28, no. 2. ACM, 1999, pp. 61–72. [46] C. Bohm, K. Railing, H.-P. Kriegel, and P. Kroger, “Density connected clustering with local subspace preferences,” in Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on. IEEE, 2004, pp. 27–34. [47] D. Nikic, J. Wu, V. Pauca, R. Plemmons, and Q. Zhang, “A novel approach to environment reconstruction in lidar and hsi datasets,” in Advanced Maui Optical and Space Surveillance Technologies Conference, 2012. 24