CLASSIFICATION OF PIXEL-LEVEL FUSED HYPERSPECTRAL AND LIDAR DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS Saurabh Morchhale, V. Paúl Pauca, Robert J. Plemmons, and Todd C. Torgersen Wake Forest University Departments of Computer Science and Mathematics Winston-Salem, NC 27109 ABSTRACT We investigate classification from pixel-level fusion of Hyperspectral (HSI) and Light Detection and Ranging (LiDAR) data using convolutional neural networks (CNN). HSI and LiDAR imaging are complementary modalities increasingly used together for geospatial data collection in remote sensing. HSI data is used to glean information about material composition and LiDAR data provides information about the geometry of objects in the scene. Two key questions relative to classification performance are addressed: the effect of merging multi-modal data and the effect of uncertainty in the CNN training data. Two recent co-registered HSI and LiDAR datasets are used here to characterize performance. One was collected, over Houston TX, by the University of Houston National Center for Airborne Laser Mapping with NSF sponsorship, and the other was collected, over Gulfport MS, by Universities of Florida and Missouri with NGA sponsorship. Index Terms— LiDAR and Hyperspectral Imaging, Convolutional Neural Networks, Data Fusion. 1. INTRODUCTION Hyperspectral imaging (HSI) combines the power of digital imaging and spectroscopy. Imaging spectrometers gather data over a wide and continuous band of the electromagnetic spectrum, which can be used to accurately determine the composition of objects and ground cover in a scene. When the images are acquired at high spatial resolution and co-registered, the resulting data provide a robust and detailed characterization of the earth’s surface and its constituent elements. Light Detection and Ranging (LiDAR) is a remote sensing method that uses light in the form of a pulsed laser to measure ranges (variable distances) to objects in a scene. These light pulses combined with other data recorded by the airborne system generate precise, three-dimensional information about the elevation and shape of the Earth’s surface characteristics. In recent years, it has been shown that remote sensing tasks, such as scene reconstruction, feature enhancement, and classification of targets, are improved when co-registered HSI and LiDAR data are used jointly. This has spurred active research into methods that can reliably fuse and extract information from these complementary sensing modalities. A number of feature-level fusion techniques have been developed that combine features extracted individually to produce a new feature set that better represents the scene [1, 2, 3, 4]. For example, Dalponte, Bruzzone, and Gianelle [5] apply the sequential forward floating selection method to extract features from denoised hyperspectral data. These features are then integrated with corrected elevation and intensity models derived from the LiDAR data and classified using support vector machines and Gaussian maximum likelihood techniques. As part of the winning entry for the 2013 GRSS Data Fusion Contest, Debes et al. [1] combined abundance maps obtained through an spectral unmixing procedure with LiDAR data, providing topological information to the classification process. A flexible strategy based on morphological features and subspace multinomial logistic regression was presented in [2] for jointly classifying HSI and LiDAR data without the need for regularization parameters. More recently, the use of deep convolutional neural networks (CNN) has been proposed for classification of hyperspectral imagery [6, 7]. In deep learning, neural networks of three or more layers are used to learn deep features of input data and can provide better approximations to nonlinear functions than single-layer classifiers, such as linear support vector machines. Inspired by visual mechanisms in living organisms, CNNs consists of layers of neurons whose outputs are combined through convolution. Some applications of CNNs include material classification, object detection, and face and speech recognition. Here, we are concerned with the classification performance of CNNs when HSI and LiDAR data are combined at the pixel level [8], that is, before feature extraction in the classification process. In this type of fusion LiDAR elevation data is replicated and appended to HSI data for each pixel in the scene. This combined data is then processed with a multilayer CNN similar to that proposed in [7] to learn the filters producing the strongest response to local input patterns. Pixel-level fusion can have an advantage over other techniques in that it tends to avoid loss of information that may occur during the feature extraction process [9]. This paper is motivated by the thesis work of Saurabh Morchhale [10]. We characterize classification performance by modifying the CNN parameters and investigate robustness to classification errors in the training data. We apply our techniques to sample classification problems using two wellknown Hyperspectral and LiDAR datasets that have been recently developed for test purposes. 2. CLASSIFICATION FRAMEWORK 2.1. Data Fusion We assume that the LiDAR and HSI datasets are georeferenced and have been pre-processed to have the same spatial resolution, providing information for the same surface area over the Earth. Let column vector h(x, y) ∈ RM1 denote the spectral response over M1 channels and d(x, y) ∈ RM2 denote a column vector of components derived from the LiDAR data, such as elevation and LiDAR intensity, at each point (x, y) in a regularly spaced grid over the observed surface. Further, each component of d(x, y) is scaled to ensure balance among the fused data sources. We define a new data vector for point (x, y) as follows: h(x, y) g(x, y) = , (1) d(x, y) ⊗ 1 where ⊗ is the Kronecker product, 1 is a column vector of ones and hence d(x, y) ⊗ 1 is a repetition of the LiDAR components. The length of vector 1 is an additional parameter that we include in the characterization of performance of CNN for fused LiDAR and HSI data. In general, the repetition of LiDAR data, relative to the length of the HSI vector, can be used as a form of weighting the desired influence of one modality over the other in the classification procedure. 2.2. Convolutional Neural Networks A convolutional neural network is a multilayer neural network inspired by the organization of neurons in the animal visual cortex [11]. It generally consists of one or more convolutional layers and intermediate subsampling layers, followed by at least one fully connected layer. Here, we consider the CNN architecture recently developed in [7] for processing HSI data. In our case, the input layer consists of the fused vectors g(x, y) from equation (1). For simplicity, we use a single index i to enumerate these vectors in the spatial domain, i.e. gi = g(x, y) for i ∈ {1, 2, ..., N }, where N is the total number of spatial points (x, y) in the dataset. The convolution layer consists of a set of K 1-D filter vectors, {fk }, of fixed length. In this layer, each of these filters is convolved with gi to produce ui,k = tanh(gi ? fk ), where ? is the convolution operation. In the max pooling layer, ui,k are subsampled by taking the maxima over non-overlapping regions of length 2, producing vectors usi,k of half the size. Next, the subsampled vectors usi,k are stacked together as T usi = usi1 , usi2 , · · · , usiK which is then used as input for the hidden neuron layer, producing output vector yi . This process is expressed as: yi = f (W(h) usi ) + b(h) , (2) where W(h) is the weight matrix associated with the hidden neuron layer and b(h) is a vector of unit bias. The number of rows, P , in W(h) corresponds to the number of neurons in the layer. The function f (·) is the layer’s activation function defined as f (x) = tanh(x) applied element-wise to the input vector in equation (2). Finally, yi is passed through the output layer to produce, ti = zi = exp (W(o) yi + b(o) ), 1 ti , kti k1 (3) (4) where the softmax function is applied to W(o) yi +b(o) . Here, W(o) is the weight matrix associated with the output layer, and b(o) is a vector of unit bias. The number of rows, C, in W(o) corresponds to the number of labeled classes specified during the training phase of the CNN. The final output vector zi contains the estimated class probabilities for the classification of input vector gi . 2.3. Optimization of Class Probabilities During the training phase of a CNN, an objective function measuring the classification error for the sample training data t is minimized. Let {gi }N i=1 denote the set of Nt samples used for training and let {Lj }C j=1 denote the set of C classification labels. Further, let x denote a vector of all the trainable parameters, specifically, {fk }, W(h) , b(h) , W(o) , and b(o) . Recall that zi in equation (4) is a vector of length C; let zi,j denote the j th component of vector zi . We define the following objective function: J(x) = − where δij = 1 0 Nt X C 1 X δi,j log(zi,j ), Nt i=1 j=1 (5) if gi belongs to class Lj , otherwise. Equation (5) is the well-known logarithmic loss function which is used to maximize predictive accuracy by rewarding correct classifications that are made with a high probability, i.e., whenever zi,j is close to 1 and sample gi ∈ Lj . We minimize (5) using a standard gradient descent approach. Starting with an initial guess x0 for a local minimum of J(x), we compute: xn+1 = xn − αn ∇J(xn ), n≥0 (6) where ∇J(xn ) is the gradient of J. In our implementation the step size αn is kept constant with αn = 0.08. We stop iterating when the relative error in the cost function is sufficiently small; this serves as a regularization constraint which tends to avoid overfitting. 3. EXPERIMENTAL RESULTS We first employ the 2013 IEEE GRSS DF Contest dataset [12] to characterize classification performance. This dataset consists of a hyperspectral image of 144 spectral bands and a Digital Surface Model (DSM) derived from airborne LiDAR at a 2.5m spatial resolution. There are 1903 × 329 pixels in the scene and true labels are known for only a small subset of these pixels. We utilized a total 2832 known (labeled) pixels spread over all C = 15 classes to form our fused dataset {gi }, with the LiDAR DSM value repeated 16 times or 11% relative to the length of the HSI vector. Too small a recurrence of the LiDAR value yields no improvement while too large a recurrence tends to decrease overall classification. We found that 11% to 14% LiDAR recurrence produced best classification results. The size of the training dataset relative to that of the test dataset is an important consideration of practical value. Too large a training dataset can lead to overfitting and is also unrealistic in most imaging applications. To avoid overfitting, we adopt the technique proposed in [13]. We partition the fused dataset {gi } into three subsets: training, validation, and testing. Specifically we choose 900 observation vectors (60 samples per class) for training, 900 observation vectors for validation, and 1032 observation vectors for testing, to characterize the classification accuracy of our CNN approach. In addition, we use K = 40 convolution filters and P = 60 neurons in the fully connected layer. 3.1. Classification Accuracy Figure 1 compares the classification accuracy obtained with HSI data alone and with fused HSI and LiDAR data. As can be observed, the accuracy in the CNN output is roughly 10% higher for the fused vectors relative to classification via the HSI data alone. Moreover an accuracy of 80% is reached in 55 iterations (or epochs) compared to 160 for the HSI data. Table 1 shows the complete error matrix for all classes in the dataset. Notice that accuracies of over 98% are achieved for stressed grass, trees, soil, tennis court and running track. Pixel-level fusion can introduce additional variability across HSI classes, as can be observed in Figure 2. Notice Fig. 2. Fused data vectors for pixels with similar hyperspectral response. how adding the LiDAR component significantly increases the distinction between commercial buildings and highways and between residential buildings and parking lot 1 spectral traces. The complementary nature of HSI and LiDAR enables this gain in variability across HSI classes. 3.2. Error in the Training Data In this experiment, we consider the possibility of misclassification error in the training data due to human or preprocessing oversight. To do this we randomly switch the labels for a percentage of the 900 data vectors gi used for training of the CNN. We then classify the 1932 testing data vectors using the so-trained CNN. Table 2 shows the effect of up to 20% misclassification error in the training data on the true positive classification. Interestingly, the algorithm appears, in most cases, to be impervious to such error, except for classes with relative low true positive classification. Corresponding results are also shown in Table 2 for classification using HSI data only. 3.3. Overall Image Classification Fig. 1. Comparison of classification accuracy, HSI vs. fused HSI-LiDAR, for the 2013 IEEE GRSS DF Contest dataset. A visual comparison in the classification of all the pixels of the 2013_IEEE_GRSS_DF_Contest dataset is given in Figure 3. For these results, 100 known pixels per class were used for training of the CNN instead of 60 (a ratio of 1.13:1 between training and test datasets). This change in the size of the training data resulted in approximately 1% improvement relative to the error matrix results shown in Table 1. Table 1. The classification accuracies (HSI vs Fused) for 2013 IEEE GRSS DF Contest dataset. HSI | Fused Healthy grass Stressed grass Synthetic grass Trees Soil Water Residential Commercial Road Highway Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track Healthy grass Stressed grass Synthetic grass Trees Soil Water Residential Commercial Road Highway Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track 96% | 95% 0% | 0% 0% | 0% 6% | 2% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 4% | 5% 100% | 100% 0% | 0% 3% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 98% | 95% 0% | 0% 0% | 0% 1% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 91% | 98% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 98% | 98% 0% | 0% 0% | 0% 2% | 0% 0% | 1% 0% | 0% 0% | 0% 6% | 0% 2% | 3% 0% | 0% 0% | 2% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 91% | 93% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 3% | 3% 85% | 90% 1% | 1% 0% | 0% 0% | 0% 17% | 9% 2% | 0% 8% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 1% | 0% 0% | 0% 1% | 4% 54% | 87% 2% | 0% 0% | 0% 0% | 0% 0% | 5% 9% | 0% 0% | 1% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 1% 0% | 0% 0% | 0% 10% | 0% 52% | 82% 14% | 1% 2% | 0% 32% | 32% 17% | 7% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 1% 0% | 0% 1% | 0% 2% | 0% 0% | 0% 2% | 0% 37% | 3% 62% | 85% 3% | 3% 15% | 3% 5% | 2% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 1% | 3% 0% | 0% 0% | 0% 0% | 0% 4% | 6% 0% | 1% 5% | 7% 7% | 6% 73% | 83% 3% | 1% 0% | 6% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 1% 0% | 0% 0% | 0% 1% | 2% 7% | 0% 3% | 11% 3% | 2% 17% | 8% 5% | 5% 42% | 52% 11% | 9% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 1% | 0% 0% | 0% 0% | 1% 2% | 2% 1% | 0% 28% | 0% 1% | 4% 0% | 0% 0% | 0% 0% | 7% 46% | 69% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 2% | 0% 0% | 0% 0% | 1% 0% | 0% 0% | 0% 0% | 0% 1% | 2% 100% | 99% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 0% | 0% 1% | 2% 0% | 0% 100% | 98% Table 2. The classification accuracies for 2013 IEEE GRSS DF Contest dataset on introduction of noise. HSI | Fused % Error No Error 1% 5% 10 % 15 % 20 % Healthy grass Stressed grass Synthetic grass Trees Soil Water Residential Commercial Road Highway Railway Parking Lot 1 Parking Lot 2 Tennis Court 96% | 95% 98% | 95% 96% | 96% 97% | 96% 96% | 96% 93% | 97% 100% | 100% 100% | 100% 100% | 100% 100% | 100% 100% | 100% 100% | 100% 98% | 95% 98% | 95% 86% | 96% 98% | 100% 98% | 100% 99% | 100% 91% | 98% 97% | 98% 82% | 99% 95% | 99% 95% | 97% 97% | 98% 98% | 98% 98% | 99% 98% | 100% 99% | 100% 100% | 100% 100% | 100% 91% | 93% 92% | 92% 91% | 93% 93% | 97% 94% | 96% 93% | 95% 85% | 90% 81% | 85% 81% | 91% 71% | 93% 58% | 92% 70% | 91% 54% | 87% 53% | 86% 53% | 86% 47% | 85% 54% | 91% 46% | 92% 52% | 82% 59% | 85% 53% | 85% 68% | 84% 79% | 71% 56% | 65% 62% | 85% 60% | 89% 54% | 82% 53% | 85% 47% | 85% 50% | 79% 73% | 83% 79% | 83% 45% | 75% 69% | 78% 71% | 82% 69% | 75% 42% | 52% 39% | 58% 50% | 44% 38% | 55% 37% | 48% 39% | 40% 46% | 69% 53% | 59% 52% | 61% 49% | 52% 37% | 56% 69% | 59% 100% | 99% 100% | 99% 100% | 99% 100% | 99% 100% | 99% 98% | 100% Running Track 100% | 98% 100% | 98% 100% | 98% 100% | 98% 98% | 98% 100% | 98% Fig. 4. MUUFL Gulfport dataset: CNN using HSI data only (left) and fused HSI and LiDAR data (middle). Google map view (right). 4. DISCUSSION AND FUTURE WORK Fig. 3. 2013 IEEE GRSS DF Contest dataset: CNN using HSI data only (top) and fused HSI and LiDAR data (middle). Google map view (bottom). 3.4. MUUFL Gulfport Dataset We also applied the algorithm to a subset of the MUUFL Gulfport dataset [14, 15] which consists of a hyperspectral image of 58 spectral bands and co-registered LiDAR elevation and intensity data. There are 320 × 360 pixels in this dataset. For training purposes we selected 60 pixels for 12 known classes and used additional 1620 labeled pixels for testing. Repetition of the LiDAR elevation and intensity components was set to 5. We use K = 20 convolution filters and P = 20 neurons in the fully connected layer. Classes labeled Targets 1-4 correspond to materials placed on the ground. Our experimental results suggest that pixel-level data fusion can be an effective approach to improving accuracy of convolutional neural network classifiers, by roughly 10% compared to classification via HSI alone. Similar results have been found relative to the improvement of classification due to fusion in non-CNN approaches [16]. Whether CNN provides better results compared to non-CNN methods is a topic of future research, though our results seem to suggest that higher accuracies can be achieved. Future work includes exploitation of the natural correlation among nearby pixels. Nearby pixels may be expected to be highly correlated in their spectral signatures, their LiDAR elevations, and in their LiDAR intensities. We believe the results presented here can be extended to use 3-D convolution to further improve classification accuracy and we plan further experiments with the MUUFL Gulfport dataset. We will also consider the question of whether the convolution steps of our CNN will tolerate some registration errors between the two modalities. Acknowledgements The authors would like to thank the Hyperspectral Image Analysis group and the NSF Funded Center for Airborne Laser Mapping (NCALM) at the University of Houston for providing the data sets used in this study, and the IEEE GRSS Data Fusion Technical Committee for organizing the 2013 Data Fusion Contest. The authors also thank P. Gader, A. Zare, R. Close, J. Aitken, G. Tuell, the University of Florida, and the University of Missouri for sharing the “MUUFL Gulfport Hyperspectral and LiDAR Data Collection” acquired with NGA funding. In addition, the authors thank the reviewers for their helpful comments and suggestions. This research was supported in part by the U.S. Air Force Office of Scientific Research (AFOSR) under Grant no. FA9550-15-1-0286. 5. REFERENCES [1] C. Debes, A. Merentitis, R. Heremans, J. Hahn, N. Frangiadakis, T. van Kasteren, W. Liao, R. Bellens, A. Pizurica, S. Gautama, et al., “Hyperspectral and lidar data fusion: Outcome of the 2013 grss data fusion contest,” Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. 7, no. 6, pp. 2405–2418, 2014. [2] M. Khodadadzadeh, J. Li, S. Prasad, and A. Plaza, “Fusion of hyperspectral and lidar remote sensing data using multiple feature learning,” Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. 8, no. 6, pp. 2971–2983, 2015. [3] D. Nikic, J. Wu, V.P. Pauca, R. Plemmons, and Q. Zhang, “A novel approach to environment reconstruction in lidar and hsi datasets,” in Advanced Maui Optical and Space Surveillance Technologies Conference, 2012, vol. 1, p. 81. [4] Q. Zhang, V. P. Pauca, R. J. Plemmons, and D. Nikic, “Detecting objects under shadows by fusion of hyperspectral and lidar data: A physical model approach,” in Proc. 5th Workshop Hyperspectral Image Signal Process.: Evol. Remote Sens, 2013, pp. 1–4. [5] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectral and lidar remote sensing data for classification of complex forest areas,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 46, no. 5, pp. 1416– 1427, 2008. [6] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. 7, no. 6, pp. 2094–2107, 2014. [7] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015. [8] C. Pohl and J. L. Van Genderen, “Review article multisensor image fusion in remote sensing: concepts, methods and applications,” International journal of remote sensing, vol. 19, no. 5, pp. 823–854, 1998. [9] M. Mangolini, Apport de la fusion d’images satellitaires multicapteurs au niveau pixel en télédétection et photointerprétation, Ph.D. thesis, Université de Nice SophiaAntipolis, 1994. [10] S. Morchhale, “Deep convolutional neural networks for classification of fused hyperspectral and LiDAR data,” M.S. thesis, Wake Forest University, MS Thesis, Dept. of Computer Science., 2016, http://csweb.cs. wfu.edu/˜pauca/MorchhaleThesis16.pdf. [11] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. Rodriguez-Sanchez, and L. Wiskott, “Deep hierarchies in the primate visual cortex: What can we learn for computer vision?,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1847–1871, 2013. [12] IEEE Geoscience and Remote Sensing Society 2013 Data Fusion Contest, “http://www.grssieee.org/community/technical-committees/data-fusion,” . [13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics Springer, Berlin, 2001. [14] P. Gader, A. Zare, R. Close, and G. Tuell, “Co-registered hyperspectral and LiDAR Long Beach, Mississippi data collection,” 2010, University of Florida, University of Missouri, and Optech International. [15] P. Gader, A. Zare, R. Close, J. Aitken, and G. Tuell, “MUUFL gulfport hyperspectral and LiDAR airborne data set,” Tech. rep. REP-2013-570, University of Florida, Oct. 2013. [16] Pengyu Hao and Zheng Niu, “Comparison of different lidar and hypespectral data fusion strategies using svm and abnet,” Remote Sensing Science, vol. 1, no. 3, 2013.