A KASSPER Real-Time Embedded Signal Processor Testbed Glenn Schrader Massachusetts Institute of Technology Lincoln Laboratory 28 September 2004 This work is sponsored by the Defense Advanced Research Projects Agency, under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-1 GES 4/21/2004 Outline • • KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks – STAP Weight Application – Knowledge Database Pre-processing • Summary HPEC 2004 KASSPER Testbed-2 GES 4/21/2004 MIT Lincoln Laboratory Knowledge-Aided Sensor Signal Processing and Expert Reasoning (KASSPER) MIT LL Program Objectives Develop systems-oriented approach to revolutionize ISR signal processing for operation in difficult environments • Develop and evaluate high• • performance integrated radar system/algorithm approaches Incorporate knowledge into radar resource scheduling, signal, and data processing functions Define, develop, and demonstrate appropriate real-time signal processing architecture SAR Spot Ground Moving Target Indication (GMTI) Search HPEC 2004 KASSPER Testbed-3 GES 4/21/2004 Synthetic Aperture Radar (SAR) Stripmap GMTI Track MIT Lincoln Laboratory Representative KASSPER Problem • Radar System Challenges • KASSPER System Characteristics – Heterogenous clutter environments – Clutter discretes – Dense target environments • Results in decreased system performance (e.g. higher Minimum Detectable Velocity, Probability of Detection vs. False Alarm Rate) – Knowledge Processing – Knowledge-enabled Algorithms ISR Platform – Knowledge Repository – Multi-function Radar with KASSPER – Multiple Waveforms + Algorithms System – Intelligent Scheduling • Knowledge Types – Mission Goals – Static Knowledge – Dynamic Knowledge Convoy B Launcher Convoy A . HPEC 2004 KASSPER Testbed-4 GES 4/21/2004 MIT Lincoln Laboratory Outline • KASSPER Program Overview • KASSPER Processor Testbed • Computational Kernel Benchmarks – STAP Weight Application – Knowledge Database Pre-processing • Summary HPEC 2004 KASSPER Testbed-5 GES 4/21/2004 MIT Lincoln Laboratory High Level KASSPER Architecture Intertial Navigation System (INS), Global Positioning System (GPS), etc. Look-ahead Scheduler Sensor Data Signal Processing Target Detects Data Processor (Tracker, etc) Track Data Knowledge Database e.g. Mission Knowledge e.g. Terrain Knowledge Static (i.e. “Off-line”) Knowledge Source e.g. Real-time Intelligence External System Operator HPEC 2004 KASSPER Testbed-6 GES 4/21/2004 MIT Lincoln Laboratory Baseline Signal Processing Chain GMTI Signal Processing GMTI Functional Pipeline Raw Sensor Data INS, GPS, etc Data Input Task Beamformer Task Filtering Filter Task Task Detect Task Target Detects Data Output Task Data Processor (Tracker, etc) Look-ahead Scheduler Knowledge Processing Track Data Knowledge Database Baseline data cube sizes: • 3 channels x 131 pulses x 1676 range cells • 12 channels x 35 pulses x 1666 range cells . HPEC 2004 KASSPER Testbed-7 GES 4/21/2004 MIT Lincoln Laboratory KASSPER Real-Time Processor Testbed Processing Architecture INS, GPS, etc Sensor Data Look-ahead Scheduler Signal Processing Results Software Architecture Data Processor (Tracker, etc) Application Application Application Parallel Vector Library (PVL) Knowledge Database Kernel Kernel Kernel Off-line Knowledge Source (DTED, VMAP, etc) KASSPER Signal Processor • • • • • Sensor Data Storage Surrogate for actual radar system “Knowledge” Storage HPEC 2004 KASSPER Testbed-8 GES 4/21/2004 Rapid Prototyping High Level Abstractions Scalability Productivity Rapid Algorithm Insertion 120 PPC G4 CPUs @ 500 MHZ 4 GFlop/Sec/CPU = 480 GFlop/Sec MIT Lincoln Laboratory KASSPER Knowledge-Aided Processing Benefits HPEC 2004 KASSPER Testbed-9 GES 4/21/2004 MIT Lincoln Laboratory Outline • KASSPER Program Overview • KASSPER Processor Testbed • Computational Kernel Benchmarks – STAP Weight Application – Knowledge Database Pre-processing • Summary HPEC 2004 KASSPER Testbed-10 GES 4/21/2004 MIT Lincoln Laboratory Space Time Adaptive Processing (STAP) Range cell of interest ? sample 1 Training Data (A) Steering Vector (v) sample 2 Solve for weights ( R=AHA, W=R-1v) … sample sample N-1 N STAP Weights (W) X Range cell/w suppressed clutter • Training samples are chosen to form an estimate of the • background Multiplying the STAP Weights by the data at a range cell suppresses features that are similar to the features that were in the training data HPEC 2004 KASSPER Testbed-11 GES 4/21/2004 MIT Lincoln Laboratory Space Time Adaptive Processing (STAP) Breaks Assumptions Range cell of interest ? sample 1 Training Data (A) Steering Vector (v) sample 2 Solve for weights ( R=AHA, W=R-1v) … sample sample N-1 N STAP Weights (W) X Range cell/w suppressed clutter • Simple training selection approaches may lead to degraded performance since the clutter in training set may not match the clutter in the range cell of interest HPEC 2004 KASSPER Testbed-12 GES 4/21/2004 MIT Lincoln Laboratory Space Time Adaptive Processing (STAP) Range cell of interest ? sample sample 1 2 Training Data (A) Solve for weights ( R=AHA, W=R-1v) Steering Vector (v) … sample N-1 sample N STAP Weights (W) X Range cell/w suppressed clutter • Use of terrain knowledge to select training samples can mitigate the degradation HPEC 2004 KASSPER Testbed-13 GES 4/21/2004 MIT Lincoln Laboratory STAP Weights and Knowledge • STAP weights computed from training samples within a • • training region To improve processor efficiency, many designs apply weights across a swath of range (i.e. matrix multiply) With knowledge enhanced STAP techniques, MUST assume that adjacent range cells will not use the same weights (i.e. weight selection/computation is data dependant) STAP Input STAP Input select 1 of N Training * Weights STAP Output Equivalent to Matrix Multiply since same weights are used for all columns HPEC 2004 KASSPER Testbed-14 GES 4/21/2004 * Weights Training External data STAP Output Cannot use Matrix Multiply since the weights applied to each column are data dependant MIT Lincoln Laboratory Benchmark Algorithm Overview Benchmark Reference Algorithm (Single Weight Multiply) STAP Output = Weights X STAP Input Benchmark Data Dependant Algorithm (Multiple Weight Multiply) STAP Input Training Set STAP Weights N pre-computed Weight vectors Weights applied to a column selected sequentially modulo N * STAP Output Benchmark’s goal is to determine the achievable performance compared to the well-known matrix multiply algorithm. HPEC 2004 KASSPER Testbed-15 GES 4/21/2004 MIT Lincoln Laboratory Test Case Overview JxKxL Single Weight Multiply K L X J K Multiple Weight Multiply L K X J K N N = # of weight matrices = 10 • Complex Data – Single precision interleaved • Two test architectures – PowerPC G4 (500Mhz) – Pentium 4 (2.8Ghz) HPEC 2004 KASSPER Testbed-16 GES 4/21/2004 J and K dimension length Data Set L dimension length • Four test cases – – – – Single weight multiply/wo cache flush Single weight multiply/w cache flush Multiple weight multiply/wo cache flush Multiple weight multiply/w cache flush MIT Lincoln Laboratory Data Set #1 - LxLxL Single Weight Multiply L X MFLOP/Sec L MFLOP/Sec L L Multiple Weight Multiply L L X L L 10 Single/wo flush Single/w flush Length (L) HPEC 2004 KASSPER Testbed-17 GES 4/21/2004 Length (L) Multiple/wo flush Multiple/w flush MIT Lincoln Laboratory Data Set #2 – 32x32xL Single Weight Multiply 32 X 32 MFLOP/Sec 32 MFLOP/Sec L Multiple Weight Multiply L 32 X 32 32 10 Single/wo flush Single/w flush Length (L) HPEC 2004 KASSPER Testbed-18 GES 4/21/2004 Length (L) Multiple/wo flush Multiple/w flush MIT Lincoln Laboratory Data Set #3 – 20x20xL Single Weight Multiply 20 X 20 MFLOP/Sec 20 MFLOP/Sec L Multiple Weight Multiply L 20 X 20 20 10 Single/wo flush Single/w flush Length (L) HPEC 2004 KASSPER Testbed-19 GES 4/21/2004 Length (L) Multiple/wo flush Multiple/w flush MIT Lincoln Laboratory Data Set #4 – 12x12xL Single Weight Multiply 12 X 12 MFLOP/Sec 12 MFLOP/Sec L Multiple Weight Multiply L 12 X 12 12 10 Single/wo flush Single/w flush Length (L) HPEC 2004 KASSPER Testbed-20 GES 4/21/2004 Length (L) Multiple/wo flush Multiple/w flush MIT Lincoln Laboratory Data Set #5 – 8x8xL Single Weight Multiply 8 X MFLOP/Sec 8 MFLOP/Sec L 8 Multiple Weight Multiply L 8 X 8 8 10 Single/wo flush Single/w flush Length (L) HPEC 2004 KASSPER Testbed-21 GES 4/21/2004 Length (L) Multiple/wo flush Multiple/w flush MIT Lincoln Laboratory Performance Observations • Data dependent processing cannot be optimized as well – Branching prevents compilers from “unrolling” loops to improve pipelining • Data dependent processing does not allow efficient “reuse” of already loaded data – Algorithms cannot make simplifying assumptions about already having the data that is needed. • Data dependent processing does not vectorize – Using either Altivec or SSE is very difficult since data movement patterns become data dependant • Higher memory bandwidth reduces cache impact • Actual performance scales roughly with clock rate rather than theoretical peak performance Approx 3x to 10x performance degradation, Processing efficiency of 2-5% depending on CPU architecture . HPEC 2004 KASSPER Testbed-22 GES 4/21/2004 MIT Lincoln Laboratory Outline • KASSPER Program Overview • KASSPER Processor Testbed • Computational Kernel Benchmarks – STAP Weight Application – Knowledge Database Pre-processing • Summary HPEC 2004 KASSPER Testbed-23 GES 4/21/2004 MIT Lincoln Laboratory Knowledge Database Architecture Knowledge Manager New Knowledge (Time Critial Targets, etc) GPS, INS, Lookup User Inputs, Update Look-ahead etc Off-line Pre-Processing Knowledge Scheduler Database Command New Knowledge Sensor Data Load/ Store/ Send Results Signal Processing Downstream Processing (Tracker, etc) Stored Data Stored Data Knowledge Cache Send Knowledge Processing Store Load Knowledge Store HPEC 2004 KASSPER Testbed-24 GES 4/21/2004 Stored Data Off-line Knowledge Reformatter Raw a priori Raster & Vector Knowledge (DTED,VMAP,etc) MIT Lincoln Laboratory Static Knowledge • Vector Data – Each feature represented by a set of point locations: Points - longitude and latitude (i.e. towers, etc) Lines - list of points, first and last points are not connected (i.e. roads, rail, etc) Areas - list of points, first and last point are connected (i.e. forest, urban, etc) – Standard Vector Formats Vector Smart Map (VMAP) Trees (area) Roads (line) • Matrix Data – Rectangular arrays of evenly spaced data points – Standard Raster Formats Digital Terrain Elevation Data (DTED) DTED HPEC 2004 KASSPER Testbed-25 GES 4/21/2004 MIT Lincoln Laboratory Terrain Interpolation Geographic Position Sampled Terrain Height & Type data Data Slant Range N Terain Height Earth Radius Converting from radar to geographic coordinates requires iterative refinement of estimate HPEC 2004 KASSPER Testbed-26 GES 4/21/2004 MIT Lincoln Laboratory Terrain Interpolation Performance Results CPU Type P4 2.8 G4 500 Time per interpolation 1.27uSec/interpolation 3.97uSec/interpolation Processing Rate 144MFlop/sec (1.2%) 46MFlop/sec (2.3%) • Data dependent processing does not vectorize – Using either Altivec or SSE is very difficult • Data dependent processing cannot be optimized as well – Compilers cannot “unroll” loops to improve pipelining • Data dependent processing does not allow efficient “reuse” of already loaded data – Algorithms cannot make simplifying assumptions about already having the data that is needed Processing efficiency of 1-3% depending on CPU architecture . HPEC 2004 KASSPER Testbed-27 GES 4/21/2004 MIT Lincoln Laboratory Outline • KASSPER Program Overview • KASSPER Processor Testbed • Computational Kernel Benchmarks – STAP Weight Application – Knowledge Database Pre-processing • Summary HPEC 2004 KASSPER Testbed-28 GES 4/21/2004 MIT Lincoln Laboratory Summary • Observations – The use of knowledge processing provides a significant system performance improvement – Compared to traditional signal processing algorithms, the implementation of knowledge enabled algorithms can result in a factor of 4 or 5 lower CPU efficiency (as low as 1%-5%) – Next-generation systems that employ cognitive algorithms are likely to have similar performance issues • Important Research Directions – Heterogeneous processors (i.e. a mix of COTS CPUs, FPGAs, GPUs, etc) can improve efficiency by better matching hardware to the individual problems being solved. For what class of systems is the current technology adequate? – Are emerging architectures for cognitive information processing needed (e.g. Polymorphous Computing Architecture – PCA, Application Specific Instruction set Processors - ASIPs)? HPEC 2004 KASSPER Testbed-29 GES 4/21/2004 MIT Lincoln Laboratory