Information-Theoretic Feature Selection for Classification: Applications to Fusion of

advertisement
Information-Theoretic Feature Selection for
1
Classification: Applications to Fusion of
Hyperspectral and LiDAR Data
Qiang Zhang, Victor P. Pauca, Robert J. Plemmons, Robert Rand, and
Todd C. Torgersen,
Abstract
In an era of massive data gathering, analysts are facing an increasingly difficult decision regarding
the amount of data features to consider for the purpose of target detection and classification. While more
data may offer additional information at the cost of potentially prohibitive increase in computation time,
it is unclear whether classification accuracy increases accordingly. In this study, we address this question
from an information-theoretic standpoint, providing upper and lower bounds of misclassification error,
using Chernoff Information (CI) as an upper bound and the resistor-average distance as a lower bound.
Our results are applied to the problem of classification from the fusion of real hyperspectral and LiDAR
data, using numerical experiments and analysis. The results show that, in principle, adding extra data
features does not decrease the CI, or equivalently the classification accuracy. However, empirically this
is not the case, and our results show that in fact the indiscriminate inclusion of more data can lead to
reduced classification accuracy. We offer a hill-climbing algorithm to incrementally rank and reorder
features in such a way that adding an extra feature to any first-k features does not lead to information
loss or, equivalently, reduction in classification accuracy.
Q. Zhang is with the Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, North Carolina.
Victor P. Pauca is with the Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina.
Robert J. Plemmons is with the Departments of Computer Science and Mathematics, Wake Forest University, Winston-Salem,
North Carolina.
Robert Rand is with the National Geospatial-Intelligence Agency.
Todd C. Torgersen is with the Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina.
April 8, 2014
DRAFT
Index Terms
Chernoff information, feature selection, high-dimensional data classification, fusion of hyperspectral
and LiDAR data, remote sensing.
I. I NTRODUCTION
The science of remote sensing for the environment and other applications in its broadest
sense includes general aerial images, satellite, and spacecraft observations of parts of the earth.
Examples of sensing modalities include, e.g., hyperspectral imaging (HSI) and Light Detection
and Ranging (LiDAR) data. HSI remote sensing technology allows one to capture images using a
range of spectra from ultraviolet to visible to infrared. Hundreds of images of a scene or object
are created using light from different parts of the spectrum. These hyperspectral images can
be used, for example, to detect objects at a distance, to identify surface minerals, objects, and
buildings from space, and space objects from the ground, leading to useful classifacation of data
features. LiDAR is a remote sensing technology that measures distance by illuminating a target
with a laser and analyzing the reflected light. Each LiDAR point has the position and color in
a (x,y,z,R,G,B) format, along with range information. Its wide applications include geography,
archaeology, geology, forestry, remote sensing, and atmospheric physics [1].
A tremendously rich literature exists on material identification, clustering, or spectrally unmixing HSI data, such as the match filter and related methods [2], [3], principal component analysis
(PCA) [4], nonnegative matrix factorization (NMF) [5], [6], [7], independent component analysis
(ICA) [8], [9], support vector machine (SVM) [10], projection pursuit [11] and hierarchical
clustering models [12]. In addition, recent advances in sensor development have allowed for both
HSI and LiDAR data to be captured simultaneously [13], [14], thus providing both geometrical as
well as spectral information for a given scene. A typical dataset collected in this way can contain
millions of LiDAR points and hundreds of gigabyte HSI images, see, e.g., [15], [16], [17], [18].
The fusion of multi-modality data, e.g., HSI and LiDAR, is an important, but challenging, aspect
of remote sensing. The huge amounts of data, provided by conventional and multi-modal sensing
platforms, present a significant challenge regarding developments of effective computational
techniques for processing, storage, transmission, and analysis of such data.
2
Facing these large amounts of data, analysts often have to make a first decision on whether
to use the entire dataset or a subset for a given classification algorithm. Quite often the dataset
becomes so large that it is computationally prohibitive to use the whole dataset, and thus
the question becomes which subset to choose. Another similar question arises that whether
hundreds of HSI images plus millions of LiDAR points would guarantee a better identification
or classification of objects. The curse of dimensionality is a well-known fact, that is better
described in the following [19],
dmax − dmin
= 0,
p→∞
dmin
lim
(1)
where p is the number of dimensions, and dmax and dmin are respectively the maximum and the
minimum distances to differentiate two objects. In other words, it becomes harder and harder to
differentiate two objects when data dimension increases. Clearly, more is not necessarily better,
but could it be worse?
Using the misclassification error as the criterion for evaluating whether more data would be
better or worse for classification, we do not rely on any specific classification algorithms, such
as PCA or SVM, but establish both an upper bound of the misclassification error, namely the
Chernoff information [20], and a lower bound, namely the resistor-average distance (RAD) [21],
a symmetrized Kullback-Leibler distance. This approach avoids empirically comparing different
match filters or clustering algorithms, some of which often excel in one dataset but fall behind
in others, while the bounds allow us evaluate which subspaces would be more appropriate in
classifying objects.
In this study we demonstrate that, theoretically, adding more data will not decrease the
classification accuracy, but empirically this is not necessarily true. For a real hyperspectral dataset,
we have shown that incrementally adding bands from shorter to longer wavelengths can at some
point reduce the Chernoff information, meaning that adding more data could instead reduce the
classification accuracy, as also observed in [22]. This raises a caution on approaches arbitrarily
adding extra dimensions of data, e.g., concatenating data from different modalities such as HSI
and LiDAR. On the other hand, the Chernoff information may also provide us an understanding
of the most information-rich dimensions or bands, e.g., a subset of wavelengths that would best
differentiate one material from others.
3
In the literature of variable selection for high-dimensional clustering, or equivalently the band
selection in hyperspectral data analysis, various importance measures have been proposed, e.g.,
density-based methods such as the excess mass test for multimodality, [23], [24], [25], criterionbased methods such as sparse k-means [26] using Between-Clusters Sum of Square (BCSS), or
model-based methods such as Bayesian Information Criterion [27]. The band selection is also
an important research topic in the hyperspectral imaging, where we see information-theoretic
approaches such as [28], [29], and for more comprehensive comparisons, see, e.g., [30]. In the
following discussions, by a feature we mean a variable either as a spectral band or a LiDAR
variable for each spatial pixel.
The differences between our approach and variable-selection, also known as band-selection,
methods in the hyperspectral literature lie in several aspects. First, rather than evaluating each
feature separately, such as by the information entropy and the contrast measure [31], or in a
small neighborhood of features, such as by the first and second spectral derivatives [32], the
spectral ratio measure [33], and the correlation measure [34], the upper and lower bounds of the
misclassification error are computed using a subset of features. A hill-climbing algorithm using
the upper bound ranks and adds features incrementally from the most information-rich to the
least, and each new feature’s information is evaluated conditional upon the set of already chosen
features.
Second, the information bounds are independent of any particular classification algorithm or
dataset, while guaranteeing that no classification algorithm will perform better or worse than
these bounds. Our proposed method is much easier to compute and can be applied to not
only hyperspectral data but also LiDAR, while some of the band-selection methods are devised
specifically for hyperspectral data. For example, the spectral-derivative methods depend on the
smoothness of neighboring-band variations that may not be applicable to LiDAR, which often
shows abrupt changes.
Third, unlike the fixed measures such as the information entropy or the spectral derivatives,
the ranks of the features can vary depending upon the underlying statistical models assumed by
classifiers. This third aspect needs to raise enough attention to analysts that rather than feeding
a chosen subset of data to a classifier without knowing its underlying statistical assumptions, we
should be more aware of these assumptions to an extent that information-theoretic bounds can
4
be established to validate the empirical results which can vary significantly from one dataset to
another.
The structure of the paper is as follows. In Sec. II, we introduce the definition of Chernoff
information and its fast computation, specifically for the exponential family of distributions.
Some empirical results are presented here to illustrate further developments. In Sec. III and
IV, we discuss the feature selection method based on Chernoff information that employs all the
bands while guaranteeing a nondecreasing Chernoff information sequence. In Sec. V, the method
is illustrated through the problem of fusing real HSI and LiDAR datasets. We then conclude in
Sec. VI with discussions.
II. C HERNOFF I NFORMATION
The expected cost of misclassification (E), also called the Bayesian Error (Risk), is formulated
as,
E=
X
i
pi
X
P (k|i),
(2)
k6=i
where pi is the prior probability of class (object/target) i, P (k|i) is the probability of classifying
class i as k. E is minimized if an observed vector x in population k satisfies
pk fk (x) > pi fi (x), i 6= k,
(3)
where fi (x) is the distribution of class i. If we assume a uniform prior distribution, then the
classification rule can be stated as: find k, such that
fk (x) > fi (x), i 6= k.
(4)
Consider a univariate two-class case in which each class has a normal distribution, as shown
in Fig. 1, where fi , i = 1, 2, is a normal distribution with mean µi , and the decision region is
2
2
) for class 1 and ( µ1 +µ
, +∞) for class 2. Moving the cut point to the left or right
(−∞, µ1 +µ
2
2
of (µ1 + µ2 )/2 increases E, shown as the shaded area in Fig. 1.
To establish an upper bound on the classification error, we use the two-class case as an
example, as
Z
E=
min(p1 f1 , p2 f2 )dx.
x
5
(5)
Fig. 1.
Classification error of a univariate two-class case.
Using the following relationship,
min(a, b) ≤ aα b1−α , ∀α ∈ [0, 1],
(6)
we obtain a first bound on E as
E≤
pα1 p1−α
2
Z
f1α (x)f21−α (x)dx.
(7)
x
Define Chernoff α-divergence as,
Z
Cα (f1 : f2 ) = − log
f1α (x)f21−α (x)dx,
(8)
x
and Chernoff information as,
C ∗ (f1 : f2 ) = max Cα (f1 : f2 ).
α∈[0,1]
(9)
Then we have
∗
∗
E ≤ pα1 p1−α
exp [−C ∗ (f1 : f2 )] ,
2
(10)
and for the uniform prior distribution case, we simply have
E ≤ exp [−C ∗ (f1 : f2 )] .
(11)
Clearly, higher C ∗ means lower misclassification error and hence will be more desirable. From
the definitions of Chernoff α-divergence and Chernoff information, we know they are both
nonnegative and asymmetric, i.e., Cα (f1 : f2 ) 6= Cα (f2 : f1 ), but Cα (f2 : f1 ) = C1−α (f1 : f2 ).
6
Applications of Chernoff information include, e.g., sensor networks [35], image segmentation
and registration [36], [37], face recognition [38] and edge segmentation [39]. Next, we look at
some basic properties of Chernoff information.
In multivariate distributions, if we have one variable independent from the rest, we know
Chernoff α-divergence or Chernoff information of all variables would be decomposable. This is
stated in the following proposition.
Proposition 1. Let f1 and f2 be two joint distributions of a d-variate random vector x =
(x1 , . . . , xd ) and a random variable y. If y is independent from x, then
Cα (f1 , f2 ) = Cα (f1x , f2x ) + Cα (f1y , f2y ),
(12)
where f1x and f2x are the marginal distributions of x, and f1y and f2y are the marginal
distributions of y.
The proof is simply from the decomposition of the distribution functions, f1 and f2 , due
to the independence assumption, and hence ignored here. Proposition 1 tells us that adding
independent variables for classification would never reduce Chernoff information, or to increase
the upper bound of misclassification error. If variable y is not independent from x, the Chernoff
α-divergence between f1 and f2 would still be no less than that between f1x and f2x .
Proposition 2. If y is not independent, then
Cα (f1 , f2 ) ≥ Cα (f1x , f2x ).
(13)
Proof: By the conditional distribution, fi (x, y) = fi (y|x)fix (x), i = 1, 2, we have
Z Z
f1α (x, y)f21−α (x, y)dxdy
Ωx
Z
Ωy
Z
=
Ωx
α
f1x
(x)f1α (y|x)
Ωy
1−α
f2x
(x)f21−α (y|x)dydx.
Denote gα (x) =
R
Ωy
(14)
f1α (y|x)f21−α (y|x)dy, and notice that ∀x, ∀α, 0 < gα (x) < 1. Hence, the
7
integral above becomes
Z
1−α
α
(x)gα (x)dx
(x)f2x
f1x
Ωx
Z
α
1−α
f1x
(x)f2x
(x)dx.
≤
(15)
Ωx
Hence, Cα (f1 , f2 ) ≥ Cα (f1x , f2x ) by definition.
Propositions 1 and 2 confirm that adding a variable, regardless of being independent or not,
should not reduce Chernoff information, or increase the misclassification error.
There is generally no closed form of Chernoff information, except special cases such as the
1-parameter exponential family. For example, the Chernoff information between two univariateGaussian distributions with a common standard deviation is
C ∗ (f1 : f2 ) =
1
(µ1 − µ2 )2 .
8σ 2
(16)
However, if we assume the class distributions belong to the exponential family, the relationships
between Chernoff α-divergence and Jensen α-divergence and Bregman divergence will lead to a
simple algorithm for computing CI [20]. The exponential family of distributions can be written
as the following,
p(x) = h(x)ehθ,f (x)i−F (θ) ,
(17)
where θ is the natural parameter vector, f (x) is the sufficient statistics vector, F (θ) is the
cumulative generating function, and h is simply a normalizing function. For example, the
multivariate normal distribution, N (µ, Σ), has θ = (Σ−1 µ, − 21 Σ−1 ), f (x) = (x, xxT ), and
F (θ) = 21 µT Σ−1 µ + 21 log det(2πΣ). Jensen α-divergence on the natural parameters is defined as
Jα (θ1 , θ2 ) = αF (θ1 ) + (1 − α)F (θ2 )
+ F (αθ1 + (1 − α)θ2 ),
(18)
and Bregman divergence on the natural parameters is defined as
B(θ1 : θ2 ) = F (θ1 ) − F (θ2 ) − [θ1 − θ2 ]F 0 (θ1 ),
where F is the cumulative generating function in the definition of the exponential family.
In the exponential family of distributions, we have the following relationships,
8
(19)
•
Chernoff α-divergence on members of the same exponential family is equivalent to a Jensen
α-divergence on the corresponding natural parameters;
•
the maximal Jensen α-divergence can be computed as an equivalent Bregman divergence;
•
Bregman divergence is also the same as the Kullback-Leibler divergence. The former is a
distance acting on natural parameters, while the latter is a distance on cannonical parameters.
These relationships help us avoid the integration and optimization steps in the computation of
Chernoff information, but to use a simple geodesic-bisection algorithm to find the optimal α∗ .
The algorithm is summized in pseudo-codes in Algorithm 1.
Input : natural parameters of two distributions belonging to the same exponential family,
θ1 and θ2 .
Output: α∗ and θ∗ = (1 − α∗ )θ1 + α∗ θ2 .
1
Initialize αm = 0, αM = 1.
2
while αM − αm > tol do
3
Compute midpoint, α0 = (αm + αM )/2.
4
if B(θ1 : θ) < B(θ2 : θ) then
5
6
7
8
9
αm = α0 ;
else
αM = α0 ;
end
end
Algorithm 1: A geodesic-bisection algorithm for computing the Chernoff information.
For a size-n sample, Chernoff information asymptotically upper-bounds the misclassification
error [40], i.e.,
lim E = e−nC(f1 ,f2 ) .
n→∞
(20)
The error also has an arbitrarily tight lower bound by an information-theoretic distance [41],
lim E = e−nC(f1 ,f2 ) ≥ e−nR(f1 ,f2 ) ,
n→∞
(21)
where R(f1 , f2 ) is the resistor-average distance (RAD) [21] defined as
R(f1 , f2 ) = (KL(f1 , f2 )−1 + KL(f2 , f1 )−1 )−1 ,
9
(22)
and KL(f1 , f2 ) is the Kullback-Liebler (KL) distance from f1 to f2 . As stated before, the
exponential-family distribution allows computing KL distance with Bregman divergence on the
natural parameters, from which we can derive RADs between pairs of distributions, and hence,
a lower bound of the misclassification error.
III. C HERNOFF I NFORMATION AND H IGH - DIMENSIONAL C LUSTERING
The multivariate normal (MVN) distribution is the most popular distribution assumed in clustering HSI and LiDAR data, e.g., the matched filter [2], [3], the orthogonal subspace projection
[42], the adaptive match filter (AMF) [43], and the adaptive cosine estimator (ACE) [44]. Hence,
in this section we focus on this particular member of the exponential family, even though
Algorithm 1 can be applied to any member in this family.
Let f1 ∼ N (µ1 , Σ1 ) and f2 ∼ N (µ2 , Σ2 ) represent two d-variate normal distributions. Then
the Chernoff α-divergence between these two distributions is,
|αΣ1 + (1 − α)Σ2 |
1
log
2
|Σ1 |α |Σ2 |1−α
α(1 − α)
(µ1 − µ2 )T [αΣ1
+
2
+ (1 − α)Σ2 ](µ1 − µ2 ).
Cα (f1 , f2 ) =
(23)
Now we add an extra dimension to both p and q, and denote p̃ ∼ N (µ̃1 , Σ̃1 ) and q̃ ∼ N (µ̃2 , Σ̃2 )
as two d + 1-dimensional normal distributions, where µ̃1 = (µ1 ; m1 ), µ̃2 = (µ2 ; m2 ), and




Σ2 s2
Σ1 s1
.
 , Σ̃2 = 
Σ̃1 = 
sT2 σ22
sT1 σ12
(24)
Proposition 2 tells us that Cα (p̃, q̃) ≥ Cα (p, q), and hence C ∗ (p̃, q̃) ≥ C ∗ (p, q). To validate this
empirically, we select a small HSI dataset of size 60×100×58, with the first two numbers being
sizes of two spatial dimensions and the last number being the size of the spectral dimension.
For the details on the dataset, please see Sec. V. We transform the dataset into a matrix of size
6, 000 × 58, and incrementally increase the number of bands, d, from 1 to 58 following the
wavelength order. At each d, we feed the data to the k-means clustering algorithm, and then
use the resulting centroids and pixel memberships to calculate the cluster-conditional means
and covariance matrices, from which we can compute CIs between pairs of cluster-conditional
10
3.2
3
Minimum Chernoff Information
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
0
Fig. 2.
10
20
30
Bands
40
50
60
Mininum Chernoff information between pairs of clusters by k-means versus data dimension.
distributions. We then take the minimum CI between all pairs which represents the largest
possible misclassification error and plot it against the number of bands used in clustering, shown
in Fig. 2. Here we set the number of clusters, k, as 5. From Fig. 2, we observe a general increasing
trend, but also ups-and-downs in the middle, which clearly violates Proposition 2.
Investigating this problem, we find that the empirical covariances of subspaces are changing
when a new band is added, i.e., Σ̃1:d,1:d 6= Σ, or the assumption in Eq. (24) is violated. Here,
Σ̃1:d,1:d indicates the first d-columns and first d-rows of Σ̃. The only possible reason for this is
that the cluster memberships of some pixels switched after a new band is added. Figure 3 shows
the number of pixels that switched their class membership after a new band is added. In the worst
case, we could have close to 30% pixels changing memberships, as shown by the peak. Clearly,
though correct classification could be done in lower dimensions, adding more data can potentially
alter these classifications leading to wrong conclusions, and the membership-switching problem
presents a challenge in clustering high dimensional data that is worth further investigation.
As claimed by Proposition 1, adding an independent variable would not decrease Chernoff
information, and because PCA would provide us with the space spanned by orthogonal principal
components, namely the principal component (PC) space, we can test the trend of CI after
incrementally adding PCs. Another concern in high dimensional data clustering is that the data
11
Number of pixels that switched membership
1800
1600
1400
1200
1000
800
600
400
200
0
Fig. 3.
0
10
20
30
Bands
40
50
60
The number of pixels that switched its membership when a new dimenions is added.
variances at different bands can also alter clustering results, see, e.g., [45], [46]. For example,
adding a large-variance variable could easily alter sample memberships as illustrated in Fig. 4,
where we see point A in the one-dimensional case, on the horizontal axis, would be classified
as the red class, but in the two-dimensional case, it would be classified as the blue class. The
two centriods, C1 and C2 , are more separated on the vertical axis, indicating a larger overall
variance on this second dimension. For this reason, it might be advisable to order features by their
variances, which is also tested here. Similarly, one could argue that normalizing each band by
its variance could help avoid the membership switching. All three cases are tested and shown in
Figure 5. The PCA approach clearly enjoys a monotonically increasing trend after incrementally
adding PCs, while ordering by variance also results in a monotonic increasing trend, although
it rises slower than the CI sequence by PCA in the beginning, but we observe a sharp jump
at the 35th ordered variable. Disappointingly, normalizing bands does not help in removing the
ups-and-downs, as seen on the black CI curve.
IV. F EATURE S ELECTION WITH C HERNOFF I NFORMATION
Though PCA enjoys the clear advantage given its orthogonality property in the PC space,
many times researchers would like to classify in the original feature space, which would be
more physically or chemically meaningful. For example, it is hard to physically interpret PC
12
14
12
C
1
10
8
6
C
4
2
2
0
Fig. 4.
A
0
2
4
6
8
10
12
An example of membership switching after adding an extra dimension.
3.5
PCA
Original
Ordered by variances
Normalized
Minimum Chernoff Information
3
2.5
2
1.5
1
Fig. 5.
0
10
20
30
Bands
Comparison of the minimun CIs with different orders of data.
13
40
50
60
vectors, which must contain negative entries, in the HSI or LiDAR data. In the original feature
space, besides ranking by variances, we can also rank them by various importance measures,
as suggested in the introduction section, e.g., the excess-mass measure [23], [24], [25] or the
BCSS used in k-means. For example, given a HSI dataset, rather than following the wavelength
order, we can rank and choose the most important bands before applying clustering algorithms.
Here, we will introduce a hill-climbing scheme that would incrementally search and add the
most information-rich set of features, while guaranteeing no loss of information in terms of
classification. The end result would be a set of chosen features, which can then be reordered by,
for example, the wavelength order for algorithms such as spectral unmixing.
Starting from an empty set X̂, at each iteration the algorithm searches for a variable, Xi , that
provides us the largest minimum CI between all pairs of clusters, i.e.,
î = max min CI(fˆi , fˆk ),
i
j,k,j6=k
(25)
where fˆk is the empirical distribution of the k th cluster estimated from the data, X̂ concatenated
with Xi . The algorithm then adds Xî to X̂. Denote the maximum CI at the j th iteration as cj and
the sample membership set as Mj . To guarantee a nondecreasing sequence of CI, if cj < cj−1 ,
we use Proposition 2 by fixing the membership as the last iteration, i.e., Mj = Mj−1 , and
recompute CIs between all pairs before finding î. The pseudo-codes are included in Algorithm
2.
Compared to the excess-mass measure or the BCSS, Chernoff information is more flexible
with model assumptions, because it allows a wide variety of exponential-family distributions,
while the excess-mass measure usually requires continuous data and the BCSS assumes the
homoscedasticity, or the equal variance. It also helps on choosing the number of clusters, since
the minimum CI between all pairs of clusters would dwindle with an increasing number of
clusters. The fixing-membership idea also ensures no loss of information or features, especially
when noisy or highly correlated data is added that could potentially alter the correct classification.
V. E XPERIMENTS
The HSI and LiDAR datasets were simultaneously acquired by sensors on the same aircraft
over Gulfport, Mississippi [9] in 2010. The ground sampling distance is 1 meter. The original
14
Input : d-dimensional data, X.
Output: Reordered d-dimensional data,X̂, and a CI sequence.
1
Initialize an index set as S = {1, 2, . . . , d}.
2
X̂ = ∅, j = 1.
3
while S 6= ∅ do
4
for i ∈ S do
5
Xi = concatenate(X̂, Xi ).
6
Cluster Xi using, e.g., k-means, and assign membership to Mj .
7
Estimate Σk and µk .
8
Compute CIijk between all pairs of clusters.
9
end
10
(î, ĵ, k̂) = maxi minj,k,j6=k CIijk .
11
cj = CIîĵ k̂ .
12
if cj < cj−1 then
13
Mj = Mj−1 .
14
Repeat steps 4 to 11.
15
end
16
X̂ = concatenate(X̂, Xî ).
17
S = S \ î.
18
j = j + 1.
19
end
Algorithm 2: A hill-climbing algorithm for reordering features.
HSI dataset has 72 bands ranging from 0.4µm to 1.0µm. After removing noisy bands, we use
58 bands on a 60 × 100 scene, and the coregistered LiDAR point cloud is preprocessed as the
implicit geometry (IG) functions, including the population function, the validity function and the
distance function, each containing 12 features. For details of IG functions, see [47]. The scene,
as shown in Fig. 6, contains parking lots, buildings, vehicles, grass and trees.
The fusion of HSI and LiDAR data considered next is on the feature level, meaning we simply
15
10
20
30
40
50
60
20
40
60
80
100
20
40
60
80
100
10
20
30
40
50
60
50
100
150
200
250
300
350
50
Fig. 6.
100
150
200
250
300
350
400
450
The top figure shows a false-color image of an HSI image at 0.72µm. The middle figure shows the height profile of
the scene from LiDAR. The bottom figure shows a 3D aerial view of the scene by Google Map (Imagery@2014 Google, Map
data@2014 Google).
16
concatenate two datesets together, with a total of 70 features. The purposes of the following
experiments include: first, whether concatenating HSI data with LiDAR would increase the
classification accuracy; second, compare the hill-climbing approach with other three ordering
schemes, namely the natural wavelength order, the PC order and the variance order; third, among
the three IG functions which would perform the best.
The experiments are set up in such a way that for each combination of HSI and LiDAR IG,
i.e., HSI only, HSI + Validity, HSI + Distance, and HSI + Population, we will compute the CI
sequences of four different orders: the wavelength order, the PC order, the variance order and the
hill-climbing order. We use the k-means clustering algorithm and set the number of clusters at 5.
We derive the empirical cluster-conditional means and covariances from the pixel memberships
given by k-means, before transforming them into natural parameters and computing the statistical
distances.
Each subplot in Fig. 7 shows four curves of CIs which correspond to four different combinations of HSI and LiDAR features. From the up-and-downs in all four HSI-only curves,
we can see that adding features following the wavelength order could reduce CIs. Adding
LiDAR features does increase CIs, or reduce the misclassification errors. The PC order mostly
enjoys an increasing sequences of CIs as more PCs are added, though we observe occasionally
small fluctuations, possibly due to the sensitivity of k-means algorithm to initial conditions.
Ordering features by their variances would not guarantee nondeceasing sequences of CIs, as
we observe some large drops in the green curves. At last, the hill-climbing approach not only
enjoys nondeceasing sequences of CIs, it also has the largest CIs in the end when all features
are included. The largest increase in CI is observed by HSI and the Distance function.
With the same experimental setup, we also computed the RADs, shown in Fig. 8, the exponential function of which provides us lower bounds of misclassification error. As expected, they have
greater values than CIs, and mostly follow the same trends, though in the hill-climbing approach
they could have occasional drops. Note that the hill-climbing algorithm can also be based on the
RADs, rather than CIs, which would provide a nondecreasing sequence of RADs. However, we
cannot guarantee nondecreasing sequences for both RAD and CI. For misclassification errors, we
tranform the CIs and RADs into the upper and lower bounds of E, by e−R(f1 ,f2 ) ≥ E ≤ e−C(f1 ,f2 ) ,
and plot them in Fig. 9. More directly, we can see the combination of HSI with the Distance
17
Natural Order
PC Order
6
6
HSI
HSI + Validity
HSI + Distance
HSI + Population
5
4
4
3
2
2
1
0
10
20
30
40
50
60
0
70
0
10
20
Variance Order
30
40
50
60
70
50
60
70
Hill−Climbing
6
10
8
4
6
4
2
2
0
Fig. 7.
0
10
20
30
40
50
60
0
70
0
10
20
30
40
The CIs of increasing numbers of features in four different orders.
function has the quickest drop in error when more features are added, and for a given tolerance
level this combination would need the smallest number of features. For example, with only 10
features, HSI+Distance would guarantee a misclassification error below 0.01.
VI. C ONCLUSIONS AND D ISCUSSIONS
In this study we have introduced both upper and lower bounds of misclassification errors
by information-theoretic statistical measures, that would help us identify and collect features
for the best classification accuracy, without reliance on any particular classification algorithm.
Recent studies of Chernoff information allow us to compute it quickly with a geodesic-bisection
algorithm on the exponential family of distributions, which is often widely assumed in analyzing
hyperspectral and LiDAR data. We have shown that, theoretically, adding more features should
not reduce Chernoff information, or the classification accuracy, but empirically on sample data,
18
Natural Order
PC Order
20
20
HSI
HSI + Validity
HSI + Distance
HSI + Population
15
10
15
10
5
0
5
0
10
20
30
40
50
60
0
70
0
10
20
Variance Order
30
40
50
60
70
50
60
70
Hill−Climbing
20
50
40
15
30
10
20
5
0
Fig. 8.
10
0
10
20
30
40
50
60
0
70
0
10
20
30
40
The RADs of increasing numbers of features in four different orders.
adding more features does not guarantee Chernoff information will not be lost. Facing this
problem, we have devised a hill-climbing algorithm to reorder the features based on their
contribution to Chernoff information, and hence to guarantee a nondecreasing sequence of CI, as
more features are added. The proposed approach does not remove any features, contrary to bandselection methods, while it is also flexible in the use of underlying distributions of clustering
models.
We have learned several things from this study. First, there are not necessarily “bad bands" or
“good bands" in all hyperspectral images, which often reminds us of data quality measures such
as the signal-to-noise ratio (SNR). For example, in Fig. 4, data in both the x-axis and the yaxis, appear to show a fairly good two-cluster pattern, but which can still cause the membership
switching of certain points. Only in the more extreme cases, when the SNRs of certain bands
19
HSI
HSI + Validity
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
10
20
30
40
50
0
60
0
10
HSI + Distance
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
Fig. 9.
0
10
20
30
40
30
40
50
60
70
HSI + Population
0.25
0
20
50
60
0
70
RAD
CI
0
10
20
30
40
50
60
70
The misclassification error bounds of features selected by the hill-climbing algorithm.
are so low or the data are so noisy that they are almost random, we should consider calling
these “bad bands" and remove them.
Second, because of the dependence of the CI on the assumption of statistical models, the
determination of information-rich bands is model-dependent, meaning that one “good" band for
a certain model might be “bad" for another model. Hence, analysts might be encouraged to
have a good understanding of underlying statistical-model assumptions before applying certain
band-selection methods.
Third, as the model complexity increases, e.g., when the number of clusters in k-means
increases, the membership switching can be more severe and hence we might expect more upsand-downs in the CIs after incrementally adding features, see, e.g., a comparison of CIs of the
3-cluster model with the CIs of the 5-cluster model in Fig. 10. In these cases, we need to be
20
3.5
Chernoff Information
3
2.5
2
1.5
3 clusters
5 clusters
1
0.5
Fig. 10.
0
10
20
30
Bands
40
50
60
Comparison of CIs by the 3-cluster model and CIs by the 5-cluster model.
more careful in choosing the features to use. Also, the membership switching is not necessarily
only happening to the more rudimentary clustering algorithms, such as k-means, because as
clusters are less separated and as the number of clusters increases, we can still observe similar
behaviors by more sophisticated algorithms.
Fourth, it might not be a good practice to incrementally add spectral bands following the
natural wavelength order as shown in Fig. 2. In our experiments, the wavelength order appears
to show poor CI sequences. We suggest that analysts consider using feature-selection techniques,
the CI approach being one of them, to rank and reorder the bands before applying their data
analysis methods. However, for physical interpretation of clustered data, the information-rich
bands selected by the hill-climbing algorithm should be reordered back in the wavelength order.
For example, the indices of the top-five ranked spectral bands by the hill-climbing algorithm could
be, {30, 27, 3, 50, 10}, and analysts should reorder them as, {3, 10, 27, 30, 50}, before applying
clustering algorithms for better interpretation of clustering results.
ACKNOWLEDGMENT
This work is sponsored by National Geospatial- Intelligence Agency (NGA) contract HM158210-C-0011 and by the Air Force Office of Scientific Research (AFOSR) under Grant FA955021
11-1-0194.
R EFERENCES
[1] A. P. Cracknell and L. W. Hayes, Introduction to remote sensing, 2 ed. London: Taylor and Francis, 2007.
[2] A. D. Stocker, I. S. Reed, and X. Yu, “Multidimensional signal processing for electro-optical target detection,” in
OE/LASE’90, 14-19 Jan., Los Angeles, CA.
International Society for Optics and Photonics, 1990, pp. 218–231.
[3] C. Funk, J. Theiler, D. Roberts, and C. Borel, “Clustering to improve matched filter detection of weak gas plumes in
hyperspectral thermal imagery,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 39, no. 7, pp. 1410–1420,
2001.
[4] I. Jolliffe, Principal component analysis.
Wiley Online Library, 2005.
[5] V. Pauca, J. Piper, and R. Plemmons, “Nonnegative matrix factorization for spectral data analysis,” Linear Algebra and its
Applications, vol. 416, no. 1, pp. 29–47, 2006.
[6] L. Miao and H. Qi, “Endmember extraction from highly mixed data using minimum volume constrained nonnegative
matrix factorization,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 45, no. 3, pp. 765–777, 2007.
[7] X. Liu, W. Xia, B. Wang, and L. Zhang, “An Approach Based on Constrained Nonnegative Matrix Factorization to Unmix
Hyperspectral Data,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 49, no. 2, pp. 757–772, 2011.
[8] T. Tu, “Unsupervised signature extraction and separation in hyperspectral images: A noise-adjusted fast independent
component analysis approach,” Optical Engineering, vol. 39, p. 897, 2000.
[9] J. Wang and C. Chang, “Applications of independent component analysis in endmember extraction and abundance
quantification for hyperspectral imagery,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 44, no. 9, pp.
2601–2616, 2006.
[10] J. Gualtieri and R. Cromp, “Support vector machines for hyperspectral remote sensing classification,” in Proceedings of
SPIE- The International Society for Optical Engineering, vol. 3584.
Citeseer, 1999, pp. 221–232.
[11] S. Chiang, C. Chang, and I. Ginsberg, “Unsupervised target detection in hyperspectral images using projection pursuit,”
Geoscience and Remote Sensing, IEEE Transactions on, vol. 39, no. 7, pp. 1380–1391, 2002.
[12] J. Tilton and W. Lawrence, “Interactive analysis of hierarchical image segmentation,” in Geoscience and Remote Sensing
Symposium, 2000. Proceedings. IGARSS 2000. IEEE 2000 International, vol. 2.
IEEE, 2002, pp. 733–735.
[13] G. P. Asner, D. E. Knapp, T. Kennedy-Bowdoin, M. O. Jones, R. E. Martin, J. Boardman, and C. B. Field, “Carnegie airborne
observatory: in-flight fusion of hyperspectral imaging and waveform light detection and ranging for three-dimensional
studies of ecosystems,” Journal of Applied Remote Sensing, vol. 1, no. 1, pp. 013 536–013 536, 2007.
[14] P. Gader, A. Zare, R. Close, and G. Tuell, “Co-registered hyperspectral and lidar long beach, mississippi data collection,”
University of Florida, University of Missouri, and Optech International, 2010.
[15] C. Chang, Hyperspectral data exploitation: theory and applications.
Wiley-Interscience, 2007.
[16] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectral and lidar remote sensing data for classification of
complex forest areas,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 46, no. 5, pp. 1416–1427, 2008.
[17] M. T. Eismann, Hyperspectral Remote Sensing.
SPIE Press, 2012.
[18] J. Schott, Remote sensing: the image chain approach.
Oxford University Press, USA, 2007.
[19] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is "nearest neighbor" meaningful?” in In Int. Conf. on
Database Theory, 1999, pp. 217–235.
22
[20] F. Nielsen, “Chernoff information of exponential families,” arXiv preprint arXiv:1102.2684, 2011.
[21] D. H. Johnson and S. Sinanovic, “Symmetrizing the kullback-leibler distance,” IEEE Transactions on Information Theory,
vol. 1, no. 1, pp. 1–10, 2001.
[22] J. A. Benediktsson, J. R. Sveinsson, and K. Amason, “Classification and feature extraction of aviris data,” Geoscience and
Remote Sensing, IEEE Transactions on, vol. 33, no. 5, pp. 1194–1205, 1995.
[23] D. W. Müller and G. Sawitzki, “Excess mass estimates and tests for multimodality,” Journal of the American Statistical
Association, vol. 86, no. 415, pp. 738–746, 1991.
[24] D. Nolan, “The excess-mass ellipsoid,” Journal of Multivariate Analysis, vol. 39, no. 2, pp. 348–371, 1991.
[25] Y.-b. Chan and P. Hall, “Using evidence of mixed populations to select variables for clustering very high-dimensional
data,” Journal of the American Statistical Association, vol. 105, no. 490, 2010.
[26] D. M. Witten and R. Tibshirani, “A framework for feature selection in clustering,” Journal of the American Statistical
Association, vol. 105, no. 490, 2010.
[27] A. E. Raftery and N. Dean, “Variable selection for model-based clustering,” Journal of the American Statistical Association,
vol. 101, no. 473, pp. 168–178, 2006.
[28] A. Martinez-Uso, F. Pla, J. Sotoca, and P. Garcia-Sevilla, “Clustering-based hyperspectral band selection using information
measures,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 45, no. 12, pp. 4158–4171, 2007.
[29] J. M. Sotoca and F. Pla, “Hyperspectral data selection from mutual information between image bands,” in Structural,
Springer, 2006, pp. 853–861.
Syntactic, and Statistical Pattern Recognition.
[30] P. Bajcsy and P. Groves, “Methodology for hyperspectral band selection,” Photogrammetric engineering and remote sensing,
vol. 70, pp. 793–802, 2004.
[31] J. C. Russ, The image processing handbook.
CRC press, 2006.
[32] J. C. Price, “Band selection procedure for multispectral scanners,” Applied Optics, vol. 33, no. 15, pp. 3281–3288, 1994.
[33] J. B. Campbell, Introduction to remote sensing.
Taylor & Francis, 2002.
[34] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification.
John Wiley & Sons, 2012.
[35] J.-F. Chamberland and V. V. Veeravalli, “Decentralized detection in sensor networks,” Signal Processing, IEEE Transactions
on, vol. 51, no. 2, pp. 407–416, 2003.
[36] F. Calderero and F. Marques, “Region merging techniques using information theory statistical measures,” Image Processing,
IEEE Transactions on, vol. 19, no. 6, pp. 1567–1586, 2010.
[37] F. A. Sadjadi, “Performance evaluations of correlations of digital images using different separability measures,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, no. 4, pp. 436–441, 1982.
[38] S. K. Zhou and R. Chellappa, “Beyond one still image: Face recognition from multiple still images or a video sequence,”
in Face Processing: Advanced Modeling and Methods.
Citeseer, 2005, pp. 547–567.
[39] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu, “Statistical edge detection: Learning and evaluating edge cues,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 1, pp. 57–74, 2003.
[40] T. M. Cover and J. A. Thomas, Elements of information theory.
John Wiley & Sons, 2012.
[41] H. Avi-Itzhak and T. Diep, “Arbitrarily tight upper and lower bounds on the bayesian probability of error,” Pattern Analysis
and Machine Intelligence, IEEE Transactions on, vol. 18, no. 1, pp. 89–91, 1996.
[42] J. C. Harsanyi and C.-I. Chang, “Hyperspectral image classification and dimensionality reduction: an orthogonal subspace
projection approach,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 32, no. 4, pp. 779–785, 1994.
23
[43] D. G. Manolakis, G. A. Shaw, and N. Keshava, “Comparative analysis of hyperspectral adaptive matched filter detectors,”
in PROCEEDINGS-SPIE THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING.
International Society for
Optical Engineering; 1999, 2000, pp. 2–17.
[44] D. Manolakis and G. Shaw, “Detection algorithms for hyperspectral imaging applications,” Signal Processing Magazine,
IEEE, vol. 19, no. 1, pp. 29–43, 2002.
[45] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” in ACM
SIGMOD Record, vol. 28, no. 2.
ACM, 1999, pp. 61–72.
[46] C. Bohm, K. Railing, H.-P. Kriegel, and P. Kroger, “Density connected clustering with local subspace preferences,” in
Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on.
IEEE, 2004, pp. 27–34.
[47] D. Nikic, J. Wu, V. Pauca, R. Plemmons, and Q. Zhang, “A novel approach to environment reconstruction in lidar and hsi
datasets,” in Advanced Maui Optical and Space Surveillance Technologies Conference, 2012.
24
Download