Wavelet Spectral Dimension Reduction of Hyperspectral Imagery on a Reconfigurable Computer Tarek El-Ghazawi1, Esam El-Araby1, Abhishek Agarwal1, Jacqueline Le Moigne2, and Kris Gaj3 1The George Washington University, Space Flight Center, 3George Mason University {tarek, esam, agarwala}@gwu.edu, lemoigne@backserv.gsfc.nasa.gov, kgaj@gmu.edu 2NASA/Goddard Objectives and Introduction Investigate Use of Reconfigurable Computing for On-Board Automatic Processing of Remote Sensing Data Remote Sensing Image Classification Applications: Land Classification, Mining, Geology, Forestry, Agriculture, Environmental Management, Global Atmospheric Profiling (e.g. water vapor and temperature profiles), and Planetary Space missions Types of Carriers: Spaceborne Airborne El-Ghazawi 2 E229 / MAPLD2004 Types of Sensing Mono-Spectral Imagery 1 band (SPOT ≡ panchromatic) Multi-Spectral Imagery 10s of bands (MODIS ≡ 36 bands, SeaWiFS ≡ 8 bands, IKONOS ≡ 5 bands) Hyperspectral Imagery 100s- 1000s of bands (AVIRIS ≡ 224 bands, AIRS ≡ 2378 bands) Multispectral / Hyperspectral Imagery Comparison El-Ghazawi 3 E229 / MAPLD2004 Different Airborne Hyperspectral Systems AISA AURORA AVIRIS GER El-Ghazawi 4 E229 / MAPLD2004 Why On-Board Processing? Problems Complex Pre- processing Steps: Image Registration / Solutions Automatic On-Board Processing Reduces the cost and the complexity of the On-The-Ground/Earth processing system Fusion Large Data Volumes larger utilization for broader community, including educational institutions Enables autonomous decisions to be Large cost and taken on-board faster critical decisions complexity of the OnThe-Ground / Earth processing systems Applications: » Future reconfigurable web sensors missions » Future Mars and planetary exploration missions Large critical decisions latency Dimension Reduction* Large data downlink bandwidth requirements Reduction of communication bandwidth Simpler and faster subsequent computations * Investigated Pre-Processing Step El-Ghazawi 5 E229 / MAPLD2004 Why Reconfigurable Computers? On-Board Processing Problems Solutions Reconfigurable Computers (RCs) Higher performance (throughput and High Computational Complexities processing power) compared to conventional processors Low performance for traditional processing platforms Lower form / wrap factors compared High form / wrap factors (size to parallel computers and weight) for parallel computing systems Higher flexibility (reconfigurability) compared to ASICs Low flexibility for traditional ASIC-Based solutions Less costs and shorter time-tosolution compared to ASICs High costs and long design cycles for traditional ASICBased solutions El-Ghazawi 6 E229 / MAPLD2004 Introduction 512 pixels Data Arrangement 224 bands Rows Pixels ≡ (Rows x Columns) Parallel Computing Scope, Reconfigurable Computing 2nd Scope 512 pixels El-Ghazawi Columns Bands Reconfigurable Computing 1st Scope Hyper Image Matrix Form 8 E229 / MAPLD2004 Data Arrangement (cnt’d) 0 0 1 2 . . Bands-1 0 1 2 . . Bands-1 (0,1) (0,cols-1) Rows (0,0) (0,1) (0,0) (Pixels-1) (rows-1,0) (rows-1,cols-1) Columns Pixels = Rows X Columns Hyper Image El-Ghazawi 0 1 2 . . Bands-1 (rows-1, cols-1) 8 Bits Array Form 9 E229 / MAPLD2004 Examples of Hyperspectral Datasets AVIRIS: INDIAN PINES’92 (400x400 by 192 bands) AVIRIS: SALINAS’98 (217x512 by 192 bands) El-Ghazawi 10 E229 / MAPLD2004 Dimension Reduction Techniques Principal Component Analysis (PCA): Most Common Method Dimension Reduction Does Not Preserve Spectral Signatures Complex and Global computations: difficult for parallel processing and hardware implementations Wavelet-Based Dimension Reduction: Preserves Spectral Signatures High-Performance Implementation Simple and Local Operations El-Ghazawi Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality) 11 E229 / MAPLD2004 2-D DWT (1-level Decomposition) L H 2 H L 2 H 2 LL HL LH HH 2 1-D DWT El-Ghazawi 2 2 L H L 12 E229 / MAPLD2004 2-D DWT (2-level Decomposition) L L L H H 2 2 2 L 2 H 2 H L 2 H 2 HL 2 LH HH 2 Second Level First Level El-Ghazawi 2 2 2 H L 13 E229 / MAPLD2004 Wavelet-Based vs. PCA (Execution Time, 500 MHz P3) Complexity: Wavelet-Based = O(MN) ; PCA = O(MN2+N3) Timer-Salinas98 158.583 160 140 122.173 Time (sec) 120 100 90.634 94.824 104.178 Wavelet 80 PCA 60 40 20 7.696 0 6/5 7.677 12/4 7.631 24/3 7.715 48/2 9.003 96/1 No.of PC/Level of Decomp. El-Ghazawi 14 E229 / MAPLD2004 Wavelet-Based vs. PCA (cnt’d) (Execution Time, 500 MHz P3) Complexity: Wavelet-Based = O(MN) ; PCA = O(MN2+N3) Wavelet-Based PCA 3% 5% 0% 0% IO_R IO_R Comp. Comp. IO_W IO_W 92% 100% WAVELET Timer GLOBAL IO_R Comp. IO_W El-Ghazawi PCA No.of PC/Level of Wavelet Decomposition 6/5 12/4 24/3 48/2 96/1 7.696 7.677 7.631 7.715 9.003 0.406 0.412 0.411 0.412 0.41 7.253 7.19 7.069 6.692 7.939 0.037 0.075 0.151 0.311 0.654 15 Timer GLOBAL IO_R Comp. IO_W No.of PC/Level of Wavelet Decomposition 6/5 12/4 24/3 48/2 96/1 90.634 94.824 104.178 122.173 158.583 0.423 0.395 0.395 0.394 0.394 90.173 94.355 103.633 121.478 157.568 0.038 0.074 0.15 0.301 0.621 E229 / MAPLD2004 Wavelet-Based vs. PCA (cnt’d) (Classification Accuracy) Implemented on the HIVE (8 Pentium Xeon/Beowulfs-Type System) 6.5 times faster than sequential implementation Classification Accuracy Similar or Better than PCA Faster than PCA El-Ghazawi 16 E229 / MAPLD2004 The Algorithm PIXEL LEVEL OVERALL Save Current Level [a] of Wavelet Coefficients Read Data Decompose Spectral Pixel Read Threshold (Th) DWT Coefficients (the Approximation) Reconstruct Individual Pixel to Original Stage Compute Level for Each Individual Pixel (PIXEL LEVEL) Reconstructed Approximation Compute Correlation (Corr) between Orig and Recon. Remove Outlier Pixels Corr < Th Get Lowest Level (L) from Global Histogram No Yes Get Current Level [a] of Wavelet Coefficients Decompose Each Pixel to Level L Add Contribution of the Pixel to Global Histogram Write Data El-Ghazawi 17 E229 / MAPLD2004 Prototyping Wavelet-Based Dimension Reduction of Hyperspectral Imagery on a Reconfigurable Computer, the SRC-6E Hardware Architecture of SRC-6E El-Ghazawi 19 E229 / MAPLD2004 SRC Compilation Process Application sources Macro sources .c or .f files .vhd or .v files HDL sources .v files P Compiler Logic synthesis MAP Compiler Netlists Object files .o files .o files Linker Place & Route .bin files Configuration bitstreams Application executable El-Ghazawi .ngo files 20 E229 / MAPLD2004 Top Hierarchy Module X L1:L5 DWT_IDWT TH MUX Llevel Y1:Y5 Correlator GTE_1: GTE_5 Histogram Level N El-Ghazawi 21 E229 / MAPLD2004 Decomposition and Reconstruction Levels of Dimension Reduction (DWT_IDWT) Level_1 Level_2 Level_3 Level_4 L Level_5 L5 2 2 L 2 L4 L’ 2 2 L L 2 L2 2 L3 L’ L’ 2 2 2 L’ L 2 L’ 2 L1 L0 2 L’ 2 2 X L’ 2 L’ L’ 2 L’ L’ 2 L’ 2 2 L’ D Y1 El-Ghazawi L’ D D Y3 Y2 22 L’ D Y4 Y5 E229 / MAPLD2004 FIR Filters (L, L’) Implementation Register C(1) Register C(2) Register C(3) X X X + Output Image F(i) … Register C(n) X Input Image D(i) El-Ghazawi 23 E229 / MAPLD2004 Correlator Module termAB term AB NAi Bi Ai Bi 16 2 log 2 N bits N N N termxx ( x, y ) term 2 X N termAB termxy MULT term2xy 2 xy term xx term yy TH 16 2 2 Shift Left (32 bits) Yi Compare termAB termyy MULT termxxtermyy GTE_i (Increment Histogrami) MULT TH El-Ghazawi MULT 24 TH2 E229 / MAPLD2004 Histogram Module GTE_1 GTE_2 GTE_3 GTE_4 GTE_5 El-Ghazawi Update Histogram Counters cnt_1 cnt_2 cnt_3 Level Selector Level cnt_4 cnt_5 25 E229 / MAPLD2004 Resource Utilization and Operating Frequency El-Ghazawi 26 E229 / MAPLD2004 Measurements Scenarios µP Functions MAP Function MAP Alloc. Read Data CM to OBM Compute OBM Transfer-In Computations to CM Write Data MAP Free Transfer-Out Repeat nstreams times End-to-End time (HW) Configuration + End-to-End time (SW) Allocation time El-Ghazawi End-to-End time with I/O 27 Release time E229 / MAPLD2004 SRC Experiment Setup and Results Salinas’98 217 X 512 Pixels, 192 Bands = 162.75 MB Number of Streams = 41 Stream Size = 2730 voxels ≈ 4 MB DMAIN Compute DMAOUT DMAIN Compute DMAOUT DMAIN Compute DMA-IN TDMA-IN Compute DMA-OUT TCOMPUTATIONS TDMA-OUT TTOTAL DMAOUT Non-Overlapped Streams TDMA-IN = 13.040 msec TCOMP = 0.62428 msec TDMA-OUT = 22.712 msec TTotal = 1.49 sec Throughput = 109.23 MB/Sec DMA-IN DMA-OUT Compute DMA-IN DMA-OUT Compute DMA-IN DMA-OUT Compute Overlapped Streams TDMA = 35.752 msec TCOMP = 0.62428 msec Xc = 0.0175 Throughput = 111.14 MB/Sec Speedupnon-overlapped = (1+ Xc) = 1.0175 (insignificant) El-Ghazawi 28 E229 / MAPLD2004 Execution Time Salinas'98 40 33.05 35 30.22 Time (sec) 30 20.44 20.23 20 15 P3 (500MHz) 23.21 25 16.16 14.27 10 5 20.21 Intel Xeon (1.8GHz) SRC-6E (Non-Overlapped) SRC-6E (Overlapped) 12.34 8.60 1.491.47 1.491.47 1.491.47 1.491.47 1.491.47 1 2 3 4 5 0 Level of Decomposition El-Ghazawi 29 E229 / MAPLD2004 Distribution of Execution Times El-Ghazawi 30 E229 / MAPLD2004 Speedup Results Salinas'98 25.00 Speedup 20.00 No Overlapping Speedup (P3, 500MHz) No Overlapping Speedup (Xeon, 1.8GHz) Overlapping Speedup (P3, 500MHz) Overlapping Speedup (Xeon, 1.8GHz) 15.00 10.00 5.00 0.00 1 2 3 4 5 Level of Decomposition El-Ghazawi 31 E229 / MAPLD2004 Concluding Remarks We prototyped the automatic wavelet-based dimension reduction algorithm on a reconfigurable architecture Both coarse-grain and fine-grain parallelism are exploited We observed a 10x speedup using the P3 version of SRC- 6E. From our previous experience we expect this speedup to double using the P4 version of SRC machine These speedup figures were obtained while I/O is still dominating. The speedup can be increased by improving I/O Bandwidth of the reconfigurable platforms El-Ghazawi 32 E229 / MAPLD2004