Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni

advertisement
Flexible Filters for
High Performance
Embedded Computing
Rebecca Collins and Luca Carloni
Department of Computer Science
Columbia University
Motivation
Stream Programming
A
B
C
• Stream Programming model
– filter: a piece of sequential programming
– channels: how filters communicate
– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing,
embedded applications
Example: Constant False-Alarm
Rate (CFAR) Detection
guard cells
compare
HPEC Challenge
http://www.ll.mit.edu/HPECchallenge/
cell under test
to
with additional factors
• number of gates
• threshold (μ)
• number of guard cells
• rows and other dimensional
data
CFAR Benchmark
uInt to
float
square
right
window
left
window
align
data
add
Data Stream:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
find
targets
CFAR Benchmark
uInt to
1
float
square
right
window
left
window
align
data
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
find
targets
CFAR Benchmark
uInt to
2
float
square
1
right
window
left
window
align
data
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
find
targets
CFAR Benchmark
1
uInt to
3
float
square
2
right
1
window
left
window
align
data
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
find
targets
CFAR Benchmark
1110 9 8 7 6
uInt
13 to
float
square
12
right
1110 9
window
8
left
3 2 1
window
add
cell under test
right
window
left
window
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
align
data
t
find
targets
Mapping
A
core 1
core 2
core 3
A
B
C
B
C
multi-core chip
Throughput: rate of
processing data tokens
Mapping
core 1
A
A
core 3
B
C
B
multi-core chip
C
Throughput: rate of
processing data tokens
Unbalanced Flow
core 1
core 2
core 3
A
B
C
“wait”
• Bottlenecks reduce throughput
– Caused by backpressure
• inherent algorithmic imbalances
• data-dependent computational spikes
Data Dependent Execution Time:
CFAR
35
30
Percent
• Using Set 1 from the HPEC
CFAR Kernel Benchmark
• Over a block of about 100
cells
• Extra workload of 32
microseconds per target
1.3 % targets
25
20
15
10
5
0
2
7.3 % targets
0.3 % targets
15
80
60
10
Percent
Percent
128
254
380
506
Execution Time (microseconds)
5
40
20
0
0
96
192
288
384
480
Execution Time (microseconds)
576
0
2
128
255
381
508
Execution Time (microseconds)
Data Dependent Execution Time
Percent
Some other examples:
• Bloom Filters (Financial, Spam detection)
• Compression (Image processing)
9
8
7
6
5
4
3
2
1
0
0.000
0.001
0.002
0.003
Execution Time(s)
0.004
0.005
Our Solution: Flexible Filters
flex-merge
flex-split
core 1
core 2
A
B
core 3
C
C
• Unused cycles on B are filled by working ahead on
filter C with the data already present on B
• Push stream flow upstream of a bottleneck
• Semantic Preservation
Related Works
• Static Compiler Optimizations
– StreamIt [Gordon et al, 2006]
• Dynamic Runtime Load Balancing
– Work Stealing/Filter Migration
•
•
•
•
[Kakulavarapu et al., 2001]
Cilk [Frigo et al., 1998]
Flux [Shah et al., 2003]
Borealis [Xing et al., 2005]
– Queue based load balancing
• Diamond [Huston et al., 2005] distributed search, queue based load
balancing, filter re-ordering
• Combination Static+Dynamic
– FlexStream [Hormati et al., 2009]
multiple competing programs
Outline
• Introduction
• Design Flow of a Stream Program with
Flexibility
• Performance
• Implementation of Flexible Filters
• Experiments
– CFAR Case Study
Design Flow of a Stream Program
with Flexible Filters
Design stream
algorithm
Mapping
• Filters
• Memory
Profile
Add Flexibility
to Bottlenecks
Profile of CFAR filters on Cell
uIntToFloat
right window
add
compiler/
design tools
find targets
0
1
2
3
4
5
Average Execution Time (microseconds) per 100 tokens
Design Considerations
Design stream
algorithm
Mapping
• Filters
• Memory
Profile
Add Flexibility
to Bottlenecks
Adding Flexibility
Design stream
algorithm
Mapping
• Filters
• Memory
Profile
Add Flexibility
to Bottlenecks
Outline
• Introduction
• Design Flow of a Stream Program with
Flexibility
• Performance
• Implementation of Flexible Filters
• Experiments
– CFAR Case Study
Mapping Stream Programs
to Multi-Core Platforms
core 1
core 2
core 3
A
B
C
core 1
A
B
C
B
C
B
C
core 2
pipeline mapping
core 1
A
B
A
core 2
core 3
C
A
sharing a core
SPMD mapping
Throughput: SPMD Mapping
core 1
1
EAA
=2
EB
B=2
EC
C=3
Suppose EA = 2, EB = 2, EC = 3
core 2
2
A
t0
B
C
core 3
3
A
B
C
SPMD mapping
t1 t2
t3
t4
t5 t6
core 1
A1
B1
C1
core 2
A2
B2
C2
core 3
A3
B3
C3
3 tokens processed in 7 timesteps,
ideal throughput = 3/7 = 0.429
Throughput: Pipeline Mapping
core 1
A
4
3
1
2
3
1
2
latency 2
t0
core 1
core 2
core 3
t1
A1
core 3
core 2
2
1 C
B
latency 3
latency 2
t2
t3
A2
B1
t4
t5
t6
t7
A3
B2
C1
throughput = 1/3 = 0.333 < 0.429
t8
A4
B3
C2
t9
Data Blocks
core 1
core 2
core 3
A
B
C
C
data block: group of data tokens
Throughput: Pipeline Augmented
with Flexibility
core 1
A
3
4
1
2
core 3
core 2
1
2
3
1 C
2
B
C
t0
core 1
core 2
core 3
t1
A1
t2
t3
A2
B1
t4
t5
t6
A3
B2
C2
C1
t7
t8
A4
B3
C2
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
t9
Outline
• Introduction
• Design Flow of a Stream Program with
Flexibility
• Performance
• Implementation of Flexible Filters
• Experiments
– CFAR Case Study
Flex-Split
Flex-split
pop data block b from in
n0 = available space on out0
n1 = |b| - n0
send n0 to out0, n1 to out1
send n0 0’s, then n1 1’s to
select
core 2
core 3
B
C
C
select
in
flex split
out0
C
in0
flex merge
out
in1
out1
C
 maintain ordering
 based on run-time state of queues
Flex-Merge
Flex-merge
pop i from
if i is 0,
if i is 1,
push token
select
pop token from in0
pop token from in1
to out
core 2
core 3
B
C
C
select
in
in0
out0
out
OverheadCof Flexibility?
in1
flex split
out1
C
flex merge
Multi-Channel
Flex-Split and Flex-Merge
flex split
flex
merge
output
channel 1
output channel 1
flex
merge
output
channel 2
output channel 2
filter
filterflex
…
flex merge
output
channel n
select
output channel n
Multi-Channel
Flex-Split and Flex-Merge
input channel 1
input channel 2
…
input channel n
filter
Multi-Channel
Flex-Split and Flex-Merge
input channel 1
Centralized
input channel 2
flex split
…
filter
filterflex
input channel n
select
flex merge
Multi-Channel
Flex-Split and Flex-Merge
input channel 1
Centralized
input channel 2
flex split
…
filter
flex merge
filterflex
input channel n
select
Distributed
input channel 1
flex split
filter
β flex split
input channel 2
…
β flex split
input channel n
select
filterflex
flex merge
Outline
• Introduction
• Design Flow of a Stream Program with
Flexibility
• Performance
• Implementation of Flexible Filters
• Experiments
– CFAR Case Study
Cell BE Processor
• Distributed Memory
• Heterogeneous
– 8 SIMD (SPU) cores
– 1 PowerPC (PPU)
• Element Interconnect
Bus
– 4 rings
– 205 Gb/s
• Gedae Programming Language
Communication Layer
Gedae
• Commercial data-flow language and programming GUI
• Performance analysis tools
CFAR Benchmark
uInt to
float
right
window
square
left
window
align
data
add
Profile of CFAR filters on Cell
uIntToFloat
right window
add
find targets
0
1
2
3
4
5
Average Execution Time (microseconds) per 100 tokens
find
targets
Data Dependency
• By changing threshold, change % targets
– 1.3 %
– 7.3 %
• Additional workload per target
– 16 µs
– 32 µs
– 64 µs
% Targets
Additional
Workload
1.3
7.3
16 μs
0.82
1.45
32 μs
1.06
1.39
64 μs
1.27
1.47
More Benchmarks
Benchmark Field
Dedup
Information
Theory
Results
Rabin block/ max chunk size
Speedup
4096/512
2.00
Image width x height
JPEG
Image
Processing
128x128
256x256
512x512
1.31
1.16
1.25
stocks/walks/timesteps
Value-atRisk
Finance
16/1024/1024
64/1024/1024
128/1024/1024
0.98
1.56
1.55
Conclusions
• Flexible filters
– adapt to data dependent bottlenecks
– distributed load balancing
– provide speedup without modification to
original filters
– can be implemented on top of general
stream languages
Download