Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University Motivation Stream Programming A B C • Stream Programming model – filter: a piece of sequential programming – channels: how filters communicate – token: an indivisible unit of data for a filter • Examples: Signal processing, image processing, embedded applications Example: Constant False-Alarm Rate (CFAR) Detection guard cells compare HPEC Challenge http://www.ll.mit.edu/HPECchallenge/ cell under test to with additional factors • number of gates • threshold (μ) • number of guard cells • rows and other dimensional data CFAR Benchmark uInt to float square right window left window align data add Data Stream: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark uInt to 1 float square right window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark uInt to 2 float square 1 right window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark 1 uInt to 3 float square 2 right 1 window left window align data add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 find targets CFAR Benchmark 1110 9 8 7 6 uInt 13 to float square 12 right 1110 9 window 8 left 3 2 1 window add cell under test right window left window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t align data t find targets Mapping A core 1 core 2 core 3 A B C B C multi-core chip Throughput: rate of processing data tokens Mapping core 1 A A core 3 B C B multi-core chip C Throughput: rate of processing data tokens Unbalanced Flow core 1 core 2 core 3 A B C “wait” • Bottlenecks reduce throughput – Caused by backpressure • inherent algorithmic imbalances • data-dependent computational spikes Data Dependent Execution Time: CFAR 35 30 Percent • Using Set 1 from the HPEC CFAR Kernel Benchmark • Over a block of about 100 cells • Extra workload of 32 microseconds per target 1.3 % targets 25 20 15 10 5 0 2 7.3 % targets 0.3 % targets 15 80 60 10 Percent Percent 128 254 380 506 Execution Time (microseconds) 5 40 20 0 0 96 192 288 384 480 Execution Time (microseconds) 576 0 2 128 255 381 508 Execution Time (microseconds) Data Dependent Execution Time Percent Some other examples: • Bloom Filters (Financial, Spam detection) • Compression (Image processing) 9 8 7 6 5 4 3 2 1 0 0.000 0.001 0.002 0.003 Execution Time(s) 0.004 0.005 Our Solution: Flexible Filters flex-merge flex-split core 1 core 2 A B core 3 C C • Unused cycles on B are filled by working ahead on filter C with the data already present on B • Push stream flow upstream of a bottleneck • Semantic Preservation Related Works • Static Compiler Optimizations – StreamIt [Gordon et al, 2006] • Dynamic Runtime Load Balancing – Work Stealing/Filter Migration • • • • [Kakulavarapu et al., 2001] Cilk [Frigo et al., 1998] Flux [Shah et al., 2003] Borealis [Xing et al., 2005] – Queue based load balancing • Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering • Combination Static+Dynamic – FlexStream [Hormati et al., 2009] multiple competing programs Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Design Flow of a Stream Program with Flexible Filters Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Profile of CFAR filters on Cell uIntToFloat right window add compiler/ design tools find targets 0 1 2 3 4 5 Average Execution Time (microseconds) per 100 tokens Design Considerations Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Adding Flexibility Design stream algorithm Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Mapping Stream Programs to Multi-Core Platforms core 1 core 2 core 3 A B C core 1 A B C B C B C core 2 pipeline mapping core 1 A B A core 2 core 3 C A sharing a core SPMD mapping Throughput: SPMD Mapping core 1 1 EAA =2 EB B=2 EC C=3 Suppose EA = 2, EB = 2, EC = 3 core 2 2 A t0 B C core 3 3 A B C SPMD mapping t1 t2 t3 t4 t5 t6 core 1 A1 B1 C1 core 2 A2 B2 C2 core 3 A3 B3 C3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 Throughput: Pipeline Mapping core 1 A 4 3 1 2 3 1 2 latency 2 t0 core 1 core 2 core 3 t1 A1 core 3 core 2 2 1 C B latency 3 latency 2 t2 t3 A2 B1 t4 t5 t6 t7 A3 B2 C1 throughput = 1/3 = 0.333 < 0.429 t8 A4 B3 C2 t9 Data Blocks core 1 core 2 core 3 A B C C data block: group of data tokens Throughput: Pipeline Augmented with Flexibility core 1 A 3 4 1 2 core 3 core 2 1 2 3 1 C 2 B C t0 core 1 core 2 core 3 t1 A1 t2 t3 A2 B1 t4 t5 t6 A3 B2 C2 C1 t7 t8 A4 B3 C2 throughput = 2/5 = 0.4 < 0.429 (but > 0.333) t9 Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select core 2 core 3 B C C select in flex split out0 C in0 flex merge out in1 out1 C  maintain ordering  based on run-time state of queues Flex-Merge Flex-merge pop i from if i is 0, if i is 1, push token select pop token from in0 pop token from in1 to out core 2 core 3 B C C select in in0 out0 out OverheadCof Flexibility? in1 flex split out1 C flex merge Multi-Channel Flex-Split and Flex-Merge flex split flex merge output channel 1 output channel 1 flex merge output channel 2 output channel 2 filter filterflex … flex merge output channel n select output channel n Multi-Channel Flex-Split and Flex-Merge input channel 1 input channel 2 … input channel n filter Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 flex split … filter filterflex input channel n select flex merge Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 flex split … filter flex merge filterflex input channel n select Distributed input channel 1 flex split filter β flex split input channel 2 … β flex split input channel n select filterflex flex merge Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments – CFAR Case Study Cell BE Processor • Distributed Memory • Heterogeneous – 8 SIMD (SPU) cores – 1 PowerPC (PPU) • Element Interconnect Bus – 4 rings – 205 Gb/s • Gedae Programming Language Communication Layer Gedae • Commercial data-flow language and programming GUI • Performance analysis tools CFAR Benchmark uInt to float right window square left window align data add Profile of CFAR filters on Cell uIntToFloat right window add find targets 0 1 2 3 4 5 Average Execution Time (microseconds) per 100 tokens find targets Data Dependency • By changing threshold, change % targets – 1.3 % – 7.3 % • Additional workload per target – 16 µs – 32 µs – 64 µs % Targets Additional Workload 1.3 7.3 16 μs 0.82 1.45 32 μs 1.06 1.39 64 μs 1.27 1.47 More Benchmarks Benchmark Field Dedup Information Theory Results Rabin block/ max chunk size Speedup 4096/512 2.00 Image width x height JPEG Image Processing 128x128 256x256 512x512 1.31 1.16 1.25 stocks/walks/timesteps Value-atRisk Finance 16/1024/1024 64/1024/1024 128/1024/1024 0.98 1.56 1.55 Conclusions • Flexible filters – adapt to data dependent bottlenecks – distributed load balancing – provide speedup without modification to original filters – can be implemented on top of general stream languages

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni

Related documents

Products

Support

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib