Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University Motivation Stream Programming A B C • Stream Programming model – filter: a piece of sequential programming – channels: how filters communicate – token: an indivisible unit of data for a filter • Examples: Signal processing, image processing, embedded applications Example: Constant False-Alarm Rate (CFAR) Detection guard cells compare HPEC Challenge http://www.ll.mit.edu/HPECchallenge/ cell under test to with additional factors • number of gates • threshold (μ) • number of guard cells • rows and other dimensional data CFAR Benchmark uInt to float square right window left window align data add Data Stream: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark uInt to 1 float square right window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark uInt to 2 float square 1 right window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark 1 uInt to 3 float square 2 right 1 window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark 1110 9 8 7 6 uInt 13 to float square 12 right 1110 9 window 8 left 3 2 1 window add cell under test right window left window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t align data t find targets Mapping A core 1 core 2 core 3 A B C B C multi-core chip Throughput: rate of processing data tokens Mapping core 1 A A core 3 B C B multi-core chip C Throughput: rate of processing data tokens Unbalanced Flow core 1 core 2 core 3 A B C “wait” • Bottlenecks reduce throughput – Caused by backpressure • inherent algorithmic imbalances • data-dependent computational spikes Data Dependent Execution Time: CFAR 35 30 Percent • Using Set 1 from the HPEC CFAR Kernel Benchmark • Over a block of about 100 cells • Extra workload of 32 microseconds per target 1.3 % targets 25 20 15 10 5 0 2 7.3 % targets 0.3 % targets 15 80 60 10 Percent Percent 128 254 380 506 Execution Time (microseconds) 5 40 20 0 0 96 192 288 384 480 Execution Time (microseconds) 576 0 2 128 255 381 508 Execution Time (microseconds) Data Dependent Execution Time Percent Some other examples: • Bloom Filters (Financial, Spam detection) • Compression (Image processing) 9 8 7 6 5 4 3 2 1 0 0.000 0.001 0.002 0.003 Execution Time(s) 0.004 0.005 Our Solution: Flexible Filters flex-merge flex-split core 1 core 2 A B core 3 C C • Unused cycles on B are filled by working ahead on filter C with the data already present on B • Push stream flow upstream of a bottleneck • Semantic Preservation Related Works • Static Compiler Optimizations – StreamIt [Gordon et al, 2006] • Dynamic Runtime Load Balancing – Work Stealing/Filter Migration • • • • [Kakulavarapu et al., 2001] Cilk [Frigo et al., 1998] Flux [Shah et al., 2003] Borealis [Xing et al., 2005] – Queue based load balancing • Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering • Combination Static+Dynamic – FlexStream [Hormati et al., 2009] multiple competing programs Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Design Flow of a Stream Program with Flexible Filters Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Profile of CFAR filters on Cell uIntToFloat right window add compiler/ design tools find targets 0 1 2 3 4 5 Average Execution Time (microseconds) per 100 tokens Design Considerations Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Adding Flexibility Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Mapping Stream Programs to Multi-Core Platforms core 1 core 2 core 3 A B C core 1 A B C B C B C core 2 pipeline mapping core 1 A B A core 2 core 3 C A sharing a core SPMD mapping Throughput: SPMD Mapping core 1 1 EAA =2 EB B=2 EC C=3 Suppose EA = 2, EB = 2, EC = 3 core 2 2 A t0 B C core 3 3 A B C SPMD mapping t1 t2 t3 t4 t5 t6 core 1 A1 B1 C1 core 2 A2 B2 C2 core 3 A3 B3 C3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 Throughput: Pipeline Mapping core 1 A 4 3 1 2 3 1 2 latency 2 t0 core 1 core 2 core 3 t1 A1 core 3 core 2 2 1 C B latency 3 latency 2 t2 t3 A2 B1 t4 t5 t6 t7 A3 B2 C1 throughput = 1/3 = 0.333 < 0.429 t8 A4 B3 C2 t9 Data Blocks core 1 core 2 core 3 A B C C data block: group of data tokens Throughput: Pipeline Augmented with Flexibility core 1 A 3 4 1 2 core 3 core 2 1 2 3 1 C 2 B C t0 core 1 core 2 core 3 t1 A1 t2 t3 A2 B1 t4 t5 t6 A3 B2 C2 C1 t7 t8 A4 B3 C2 throughput = 2/5 = 0.4 < 0.429 (but > 0.333) t9 Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select core 2 core 3 B C C select in flex split out0 C in0 flex merge out in1 out1 C maintain ordering based on run-time state of queues Flex-Merge Flex-merge pop i from if i is 0, if i is 1, push token select pop token from in0 pop token from in1 to out core 2 core 3 B C C select in in0 out0 out OverheadCof Flexibility? in1 flex split out1 C flex merge Multi-Channel Flex-Split and Flex-Merge flex split flex merge output channel 1 output channel 1 flex merge output channel 2 output channel 2 filter filterflex … flex merge output channel n select output channel n Multi-Channel Flex-Split and Flex-Merge input channel 1 input channel 2 … input channel n filter Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 flex split … filter filterflex input channel n select flex merge Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 flex split … filter flex merge filterflex input channel n select Distributed input channel 1 flex split filter β flex split input channel 2 … β flex split input channel n select filterflex flex merge Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Cell BE Processor • Distributed Memory • Heterogeneous – 8 SIMD (SPU) cores – 1 PowerPC (PPU) • Element Interconnect Bus – 4 rings – 205 Gb/s • Gedae Programming Language Communication Layer Gedae • Commercial data-flow language and programming GUI • Performance analysis tools CFAR Benchmark uInt to float right window square left window align data add Profile of CFAR filters on Cell uIntToFloat right window add find targets 0 1 2 3 4 5 Average Execution Time (microseconds) per 100 tokens find targets Data Dependency • By changing threshold, change % targets – 1.3 % – 7.3 % • Additional workload per target – 16 µs – 32 µs – 64 µs % Targets Additional Workload 1.3 7.3 16 μs 0.82 1.45 32 μs 1.06 1.39 64 μs 1.27 1.47 More Benchmarks Benchmark Field Dedup Information Theory Results Rabin block/ max chunk size Speedup 4096/512 2.00 Image width x height JPEG Image Processing 128x128 256x256 512x512 1.31 1.16 1.25 stocks/walks/timesteps Value-atRisk Finance 16/1024/1024 64/1024/1024 128/1024/1024 0.98 1.56 1.55 Conclusions • Flexible filters – adapt to data dependent bottlenecks – distributed load balancing – provide speedup without modification to original filters – can be implemented on top of general stream languages