PDF, 4MB - Berkeley Reconfigurable Architectures, Systems, and

advertisement
Design Automation
for Streaming Systems
IA
IB
Eylon Caspi
University of California, Berkeley
OA
OB
12/2/05
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA
• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations
• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
2
Large System Design Challenges
♦ Devices growing with Moore’s Law
• AMD Opteron dual core CPU: ~230M transistors
• Xilinx Virtex 4 / Altera Stratix-II FPGAs: ~200K LUTs
♦ Problems of DSM, large systems
• Growing interconnect delay, timing closure
♦
“Routing delays typically account for 45% to 65%
of the total path delays” (Xilinx Constraints Guide)
• Slow place-and-route
• Design complexity
• Designs do not scale well on next gen. device; must redesign
♦ Same problems in FPGAs
12/2/05
Eylon Caspi
3
Limitations of RTL
♦ RTL = Register Transfer Level
♦ Fully exposed timing behavior
• always @(posedge clk) ...
♦ Laborious, error prone
♦ Unpredictable interconnect delay
• How deep to pipeline?
• Redesign on next-gen device
♦ Undermines reuse
♦ Existing solutions
• Modular design
• Floorplanning
• Physical synthesis
• Hierarchical CAD
• Latency insensitive communication
12/2/05
Eylon Caspi
4
Streams
♦ A better communication abstraction
♦ Streams connect modules
Stream
Memory
Module
(compute)
• FIFO buffered channel (queue)
• Blocking read
• Timing independent (deterministic)
♦ Robust to communication delay
• Pipeline across long distances
• Robust to unknown delay
Post-placement pipelining
♦ Alternate transport (packet switched NOC)
♦
♦ Flexibly timed module interfaces
• Robust to module optimization (pipeline, reschedule, etc.)
♦ Enhances modular design + reuse
12/2/05
Eylon Caspi
5
Streaming Applications
♦ Persistent compute structure (infrequent changes)
♦ Large data sets, mostly sequential access
♦ Limited feedback
♦ Implement with deep,
system level pipelining
♦ E.g. DSP, multimedia
♦ JPEG Encoder:
12/2/05
Eylon Caspi
6
Ad Hoc Streaming
♦ Every module needs streaming flow control
• Block if inputs not available, output not ready to receive
♦ Every stream needs queueing
• Pipeline to match interconnect delay
• Queue to absorb delay mismatch, dynamic rates
♦ Manual implementation, in HDL
• Laborious
(flow control, queues)
• Error prone
(deadlock if violate protocol, queue too small)
• No automation (pipeline depth, queue choice / width / depth)
♦ Interconnect / queue IP (e.g. OCP / Sonics Bus)
• Still no automation
12/2/05
Eylon Caspi
7
Systematic Streaming
♦ Strong stream semantics: Process Networks
• Stream = FIFO channel with (flavor of) blocking read
• E.g. Kahn Process Networks,
E.g. Dataflow Process Networks (E.A.Lee)
♦ Streams as programming primitive
• Language support hides flow control
♦ Compiler support
• Compiler generated flow control
• Compiler controlled pipelining, queue depth, queue impl.
• Compiler optimizations (e.g. module merging, partitioning)
♦ Benefits
• Easy, correct, high performance • Portable
• Paging / Virtualization is a logical extension (Automatic page partitioning)
12/2/05
Eylon Caspi
8
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA
• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations
• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
9
SCORE Model
Stream Computations
Organized for
Reconfigurable Execution
♦ Application = Graph of stream-connected operators
♦ Operator
= Process with local state
♦ Stream
= FIFO channel,
unbounded capacity,
blocking read
♦ Segment
= Memory, accessed
via streams
Segment
Operator
(SFSM)
♦ Dynamics:
• Dynamic I/O rates
• Dynamic graph construction
Stream
(omitted in this work)
12/2/05
Eylon Caspi
10
SCORE Programming Model: TDF
♦ TDF = behavioral language for
• SFSM Operators (Streaming Extended Finite State Machine)
• Static operator graphs
♦ State machine for
• Sequencing, branching
• Firing control
♦ Firing semantics
• In state X, wait for X’s inputs, then evaluate X’s action
i
j
state foo (i, j):
o = i + j;
goto bar;
}
12/2/05
o
Eylon Caspi
11
SCORE / TDF Process Networks
♦ A process from M inputs to N outputs,
unified stream type S (i.e. SM→SN)
♦ SFSM = (Σ, σ0, σ, R, fNS, fO)
• Σ
= Set of states
• σ0 ∈ Σ
= Initial state
• σ∈Σ
= Present state
• R ⊆ (Σ × SM) = Set of firing rules
• fNS : R→Σ
= Next state function
• fO : R→SN
= Output function
♦ Similar to dataflow process networks
[Lee+Parks, IEEE May ‘95],
but with stateful actors
12/2/05
Eylon Caspi
12
Related Streaming Models
♦ Streaming Models
• Kahn PN, DFPN, BDF, SDF, CSDF, HDF,
StreamsC, YAPI, Catapult C, SHIM
♦ Streaming Platforms
• Pleiades, Philips VSP, Imagine, TRIPS
♦ How do we differ?
• Stateful processes
• Deterministic
• Dynamic dataflow rates (FSM nodes)
• Direct synthesis to hardware
• Bounded Buffers
12/2/05
Eylon Caspi
13
Streaming Platforms
♦ FPGA
(this work)
♦ Paged FPGA
• Page = fixed size partition,
connected by streams
• Stylized page-to-page interconnect
• Hierarchical PAR
♦ Paged, Virtual FPGA (SCORE)
• Time shared pages
• Area abstraction (virtually large)
♦ Multiprocessor on Chip
♦ Heterogeneous
12/2/05
Eylon Caspi
14
The Compilation Problem
Programming Model: TDF
Execution Model: FPGA
• Communicating SFSMs
• Single circuit / configuration
- unrestricted size, # IOs, timing
- one or more clocks
• Unbounded stream buffering
• Fixed size queues
memory
segment
TDF
operator
Compile
stream
12/2/05
FPGA
Big semantic gap
Eylon Caspi
15
The Semantic Gap
♦ Semantic gap between TDF, HW
♦ Need to bind:
•
•
•
•
•
•
•
Compile
FPGA
Stream protocol
Stream pipelining
Queue implementation
Queue depths
SFSM synthesis style (behavioral synthesis)
Memory allocation
Primary I/O
♦ SCORE device binds some implementation decisions
(custom hardware), raw FPGA does not
♦ Want to characterize cost of implementation decisions
12/2/05
Eylon Caspi
16
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA
• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations
• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
17
Compilation Tool Flow
Application
• Local optimization
• System optimization
• Queue sizing
TDF
tdfc
• Pipeline extraction
• SFSM partitioning / merging
• Pipelining
• Generate flow ctl, streams, queues
Verilog
Synplify
• Behavioral Synthesis
• Retiming
EDIF
(Unplaced LUTs, etc.)
Xilinx ISE
• Slice packing
• Place and route
Device Configuration
Bits
12/2/05
Eylon Caspi
18
Wire Protocol for Streams
♦
D = Data,
V = Valid,
B = Backpressure
♦
Synchronous transaction protocol
• Producer asserts V when D ready, Consumer deasserts B when ready
• Transaction commits if (¬B ∧ V) at clock edge
• Encode EOS E as extra D bit (out of band, easy to enqueue)
Clk
D (Data),
Producer
E (EOS)
V (Valid)
D
Consumer
B (Backpressure)
12/2/05
Eylon Caspi
V
B
19
Operator Firing
♦ In state X, fire if
• Inputs desired by X are ready (Valid, EOS)
• Outputs emitted by X are ready (Backpressure)
♦ Firing guard / control flow
• if (iv && !ie && !ob) begin
ib=0; ov=1;
...
end
id,e
iv
ib
Op
od,e
ov
ob
♦ Subtlety: master, slave
• Operator is slave
♦
To synchronize streams, (1) wait for flow control in, (2) fire / emit out
• Connecting two slaves would deadlock
• Need master (queue) between every pair of operators
12/2/05
Eylon Caspi
20
SFSM
Synthesis
B V E D
Control
Data registers
Stream I/O
FSM
♦ Implemented as
Behavioral
Verilog, using
state ‘case’ in
FSM and DP
Datapath
For State 1
Datapath
For State 2
♦ FSM handles
firing control,
branching
♦ FSM sends state
to DP
♦ DP sends bool.
flags to FSM
Datapath
12/2/05
21
B V E D
FSM Module, Firing Control
TDF:
Verilog
FSM
Module:
12/2/05
foo (input
input
output
{
state one
...
}
unsigned[16] x,
unsigned[16] y,
unsigned[16] o)
(x, eos(y)) : o=x+1;
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e, y_v, y_b,
o_e, o_v, o_b, state, statecase);
...
always @* begin
x_b_=1; y_b_=1; o_e_=0; o_v_=0;
state_reg_ = state_reg;
Default
statecase_ = statecase_stall;
did_goto_ = 0;
case (state_reg)
state_one: begin
if (x_v && !x_e && y_v && y_e && !o_b)
statecase_ = statecase_1;
x_b_=0; y_b_=0; o_v_=1; o_e_=0;
end
...
end // always @*
Eylon Caspi
endmodule // foo_fsm
is stall
Firing condition(s)
for state one
begin
Stream flow ctl
for state one
22
Data-Path Module
TDF:
Verilog
Data-path
Module:
12/2/05
foo (input
input
output
{
state one
...
}
unsigned[16] x,
unsigned[16] y,
unsigned[16] o)
(x, eos(y)) : o=x+1;
module foo_dp (clock, reset, x_d, y_d, o_d, state, statecase);
...
always @* begin
o_d_=16’bx;
Default
did_goto_ = 0;
case (state)
state_one: begin
if (statecase_ == statecase_1)
o_d_ = (x_d + 1’d1);
end
...
end // always @*
endmodule // foo_dp
Eylon Caspi
begin
is stall
Firing condition(s)
for state one
Data-path
for state one
23
Stream Buffers (Queues)
♦ Systolic
• Cascade of depth-1
stages (or depth-N)
♦ Shift register
• Put: shift all entries
• Get: tail pointer
♦ Circular buffer
• Memory with
head / tail pointers
12/2/05
Eylon Caspi
24
Enabled Register Queue
♦ Systolic, depth-1 stage
iB
♦ 1 state bit (empty/full) = V
iV
iD
oV
oD
en
♦ Shift in data unless:
• Full and downstream not ready
to consume queued element
♦ Area  1 FF per data bit
• On FPGA  1 LUT cell per data bit
• Depth-1 (single stage) nearly free,
oB
since FFs pack with logic
♦ Speed: as fast as FF
• But combinationally connects producer + consumer via B
12/2/05
Eylon Caspi
25
Xilinx SRL16
♦ SRL16 = Shift register of depth 16
in one 4-LUT cell
• Shift register of arbitrary width: parallel SRL16,
arbitrary depth: cascade SRL16
♦ Improve queue density by 16x
4-LUT Mode
12/2/05
SRL16 Mode
Eylon Caspi
26
Shift Register Queue
♦ State: empty bit +
capacity counter
iB iV iD,E
♦ Data stored in shift register
• In at position 0
• Out at position Address
♦ Address = number of
stored elements minus 1
♦ Synplify infers SRL16E
from Verilog array
0
+1
-1
FSM
en
Address
Empty
Shift Reg
• Parameterized depth, width
♦ Flow control
• ov = (State==Non-Empty)
• ib = !(Address==Depth-1)
NonEmpty
=Depth-1
=0
full
♦ Performance improvements
zero
• Registered data out
• Registered flow control
• Specialized, pre-computed
fullness
12/2/05
oB oV oD,E
Eylon Caspi
27
SRL Queue with Registered Data Out
♦ Registered data out
iB iV iD,E
• od (clock-to-Q delay)
• Non-retimable
♦ Data output register
extends shift register
♦ Bypass shift register
when queue empty
♦ 3 States
♦ Address = number
of stored elements
minus 2
♦ Flow control
0
+1
-1
Empty
en
Address
Shift Reg
One
=Depth-2
More
full
zero
=0
Data Out
• ov = !(State==Empty)
• ib = (Address
==Depth-2)
12/2/05
oB oV oD,E
Eylon Caspi
28
SRL Queue with Registered Flow Ctl.
♦ Registered flow ctl.
iB iV iD,E
• ov (clock-to-Q delay)
• ib (clock-to-Q delay)
• Non-retimable
0
♦ Flow control
• ov_next = !(State_next
en
Address
• ib_next =
♦ Based on precomputed fullness
-1
Empty
==Empty)
(Address_next
==Depth-2)
+1
Shift Reg
One
More
• full_next =
full_next
full
zero
=Depth-2
=0
Data Out
(Address_next
==Depth-2)
oB oV oD,E
12/2/05
Eylon Caspi
29
SRL Queue with Specialized,
Pre-Computed Fullness
♦ Speed up critical full
pre-computation by
special-casing full_next
for each state
iB iV iD,E
♦ Flow control
• ov_next = !(State_next
==Empty)
• ib_next = full_next
♦ zero pre-computation is
less critical
♦ Result
• >200MHz unless very
large (e.g. 128 x 128)
• All output delays are
clock-to-Q
• Area ≈ 3 x (SRL16E area)
12/2/05
0
+1
-1
Empty
en
Address
Shift Reg
One
full
=Depth-3
=0
More
zero
Data Out
oB oV oD,E
Eylon Caspi
30
SRL Queue Speed
12/2/05
Eylon Caspi
31
SRL Queue Area
12/2/05
Eylon Caspi
32
Page Synthesis
♦ Page = Cluster of
Operator(s) + Queues
♦ SFSMs
Op 1
FSM
• One or more per page
• Further decomposed into
Op 1
Datapath
FSM, data-path
♦ Page Input Queues
• Deep
• Drain pipelined page-topage streams before
reconfiguration
Page
Input
Queue(s)
Queue
♦ In-page Queues
• Shallow
♦ Separately Synthesizable
Modules
Op 2
FSM
• Separate characterization
• Consider custom resources
12/2/05
Eylon Caspi
Op 2
Datapath
33
Page Synthesis
♦ Module Hierarchy
• Individual SFSMs
(combinational cores)
Op 1
FSM
• Local / output queues
Op 1
Datapath
• Operators and
local / output queues
• Input queues
• Page
Page
Input
Queue(s)
Queue
Op 2
FSM
12/2/05
Eylon Caspi
Op 2
Datapath
34
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA
• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations
• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
35
Tool Flow, Revisited
♦
Separate compilation
for application, SFSMs
• Page
• SFSM
• Datapath • FSM
Application
tdfc
Verilog
Synplify
EDIF
Tool Options
• Identical queuing for every stream
(SRL16 based, depth 16)
• I/O boundary regs
(for Xilinx static timing analysis)
• Synplify 8.0
• Target 200MHz
• Optimize: FSM, retiming, pipelining
• Retain monolithic FSM encodings
(Unplaced LUTs, etc.)
Xilinx ISE
Device Configuration
Bits
12/2/05
Eylon Caspi
• ISE 6.3i
• Constrain to minimum square area,
at least max slice packing + 20%,
expand if fail PAR
• Device: XC2VP70 -7
36
PAR Flow for Minimum Area
EDIF
Constraints
♦
EDIF netlist from Synplify
♦
Constraints file
• Page area
• Target Period
ngdbuild
♦
ngdbuild:
• Convert netlist EDIF → NGD
map
♦
map:
• Pack LUTs, MUXes, etc. into slices
Ok?
yes
no
trce
Target
packed
timing
par
Ok?
yes
trce
Target
packed
slices
no
Target
1 extra
row/col
♦
trce: (pre-PAR)
• Static timing analysis, logic only
♦
par:
• Place and route
♦
trce: (post-PAR)
• Static timing analysis
37
SCORE Applications
♦ 7 Multimedia Applications / 279 Operators
• MPEG, JPEG, Wavelet, IIR
• Written by Joe Yeh
• Mostly feed-forward streaming
• Constant consumption / production ratios,
except compressors (ZLE, Huffman)
Streams
Application
Speed
Area
%Area
%Area
SFSMs
Segments
In
Local
Out
(MHz)
(4-LUT cells)
FSMs
Queues
IIR
8
0
1
7
1
166
1,922
3.4%
27.7%
JPEG Decode
9
1
1
41
8
47
7,442
7.0%
28.7%
JPEG Encode
11
4
8
42
1
57
6,728
7.5%
36.9%
MPEG Encode IP
80
16
6
231
1
47
41,472
5.5%
39.7%
MPEG Encode IPB
114
17
3
313
1
50
65,772
5.2%
40.5%
Wavelet Encode
30
6
1
50
7
106
8,320
10.1%
32.0%
Wavelet Decode
27
6
7
49
1
109
8,712
8.5%
29.6%
Total
279
50
27
733
20
140,368
5.9%
38.3%
12/2/05
Eylon Caspi
38
Page Area
DCT, IDCT
♦
87% of SFSMs are smaller than 512 LUTs — by design
♦
FSMs small
12/2/05
♦ Datapaths dominate in most large pages
Eylon Caspi
39
Page Speed
43%
47%
♦
FSMs (flow control) are fast, never critical
♦
Queues are critical for 1/3 fastest pages
12/2/05
Eylon Caspi
10%
♦ Datapaths dominate
40
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA
• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations
• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05
Eylon Caspi
41
Improving Performance, Area
♦ Local (module) Optimization
• Traditional compiler optimization
♦
(const folding, CSE, etc.)
• Datapath pipelining / loop scheduling
• Granularity transforms
♦
(composition / decomposition)
♦ System Level Optimization
• Interconnect pipelining
• Shrink / remove queues
• Area-time transformations
♦
12/2/05
(rate matching, serialization, parallelization)
Eylon Caspi
42
Pipelining With Streams
♦ Datapath pipelining
• Add registers at output (or input)
• Retime into place
♦ Harder in practice (FSM, cycles)
• Add registers at strategic locations
• Rewrite control
• Avoid violating communication protocol
♦ Stream pipelining
• Add registers on streams
• Retime into datapath
• Modify queues, not processes
12/2/05
Eylon Caspi
DP
FSM
DP
FSM
DP
43
Logic Pipelining
♦ Add L pipeline registers to D, V
♦ Retime backwards
• This pipelines feed-forward parts of producer’s data-path
♦ Stale flow control may overflow queue (by L)
♦ Modify queue to emit back-pressure when empty slots ≤ L
♦ No manual modification of processes
Retime
Producer
12/2/05
D
(Data)
D
V
(Valid)
B
(Backpressure)
Queue
with
L Reserve
Eylon Caspi
V
Consumer
B
44
Logic Relaying + Retiming
♦ Break-up deep logic in a process
♦ Relay through enabled register queue(s)
♦ Retime registers into adjacent process
• This pipelines feed-forward parts of process’s datapath
• Can retime into producer or consumer
♦ No manual modification of processes
Retime
Producer
D
D
V
V
en
B
12/2/05
B
Eylon Caspi
D
Original
Queue
V
Consumer
B
45
Benefits, Limitations
♦ Benefits
• Simple to implement, relies only on retiming
• Sufficient for many cases, e.g. DCT, IDCT
♦ Limitations
• Feed-forward only (weaker than loop sched.)
• Resource sharing obfuscates retiming opportunities
♦ Extends to interconnect pipelining
• Do not retime into logic — register placement only
• Also pipeline B, modify queue
12/2/05
Eylon Caspi
46
Pipelining Configuration
Retime
Retime
D
V
Queue
with Reserve
D
D
V
V
Lp
B
Retime
SFSM
D
D
D
V
V
V
en
en
B
B
B
B
B
Input side
Logic Relaying
Logic
Pipelining
Output side
Logic Relaying
Li
Lp
Lr
♦ Pipeline depth parameters: Li+Lp+Lr
♦ Uniform pipelining: same depths for every stream
12/2/05
Eylon Caspi
47
Speedup from Logic Pipelining
Enabled Regs (Lr)
12/2/05
Eylon Caspi
D FFs (Lp)
48
Expansion from Logic Pipelining
Enabled Regs (Lr)
12/2/05
Eylon Caspi
D FFs (Lp)
49
Some Things Are Better Left Unpipelined
♦ Page
speedup:
♦ Page
expansion:
♦ Initially fast
pages should
not be pipelined
12/2/05
Eylon Caspi
50
Page Specific Logic Pipelining
♦ Separate pipelining of each SFSM
♦ Assumption:
application speed = slowest page speed
• Critical Page
♦ Repeatedly improve slowest page
until no further improvement is possible
♦ Page improvement heuristics
• Greedy Lr :
• Greedy Lp :
• Max :
Add one level of pipelining in 0+0+Lr
Add one level of pipelining in 1+Lp+0
Pipeline to best page speed (brute force)
♦ Greedy heuristics may end early
• Non-monotonicity: adding a level of pipelining may slow page
12/2/05
Eylon Caspi
51
Speedup from Page Specific
Enabled Regs (Lr)
12/2/05
D FFs (Lp)
Eylon Caspi
52
Expansion from Page Specific
Enabled Regs (Lr)
12/2/05
D FFs (Lp)
Eylon Caspi
53
Interconnect Delay
♦ Critical routing delay grows with circuit size
• Routing delay for an application:
avg. 45% - 56%
• Routing delay for its slowest page: avg. 40% - 50%
• Ratio (appl. to slowest page):
avg. 0.99x - 1.34x
♦
Averaged over 7 apps / varies with logic pipelining
♦ Modular design helps
• Retain critical routing delay of page, not application
• Page-to-page delays (streams) can be pipelined
12/2/05
Eylon Caspi
54
Interconnect Pipelining
♦ Add W pipeline registers to D, V, B
• Mobile registers for placer
• Not retimable
♦ Stale flow control may overflow queue (by 2W)
• Staleness = total delay on B-V feedback loop = 2W
♦ Modify downstream queue to emit back-pressure when
empty slots ≤ 2W
Long distance
Producer
D
(Data)
V
(Valid)
D
Queue
with
2W Reserve
B
(Backpressure)
12/2/05
V
Consumer
B
Eylon Caspi
55
Speedup from Interconnect Pipelining
12/2/05
Eylon Caspi
56
Speedup from Interconnect Pipelining,
No Area Constraint
12/2/05
Eylon Caspi
57
Expansion from Interconnect
Pipelining, No Area Constraint
12/2/05
Eylon Caspi
58
Interconnect Register Allocation
♦ Commercial FPGAs / tool flows
•
•
•
•
No dedicated interconnect registers
Allocation: add to netlist, slice pack, place-and-route
If pack registers with logic
 limited register mobility
If pack registers alone
 area overhead
♦ Better: Post-placement register allocation
• Weaver et al., “Post-Placement C-Slow Retiming for the Xilinx
•
•
•
•
Virtex FPGA,” FPGA 2003
Allocation: PAR, c-slow, retime, scavenge registers, reroute
No area overhead (scavenge registers from existing placement)
Better performance, since know routing delay
Modification for streaming:
♦
12/2/05
PAR, pipeline, retime, scavenge registers, reroute,
modify queue depths (configuration specialization)
Eylon Caspi
59
Throughput Modeling
♦ Pipelining feedback loops may reduce
throughput (tokens per clock period)
• Which loops / streams are critical?
♦ Throughput model for PN
• Feedback cycle C with
M tokens, N pipe delays,
has token period:
TC = M/N
• Overall token period: T = maxC {TC }
• Available slack:
CycleSlackC = (T - TC)
• Generalize to multi-rate, dynamic rate by unfolding
TC1 = 3
TC2 = 2
equivalent single-rate PN
12/2/05
Eylon Caspi
60
Throughput Aware Optimizations
♦ Throughput aware placement
• Adapt [Singh+Brown, FPGA 2002]
• Stream slack:
Te = maxC s.t. e∈C {TC }
• Stream net criticality: crit = 1 - ((T - Te) / T)
♦ Throughput aware pipelining
• Pipeline stream w/o exceeding slack
• Pipeline module s.t. depth does not exceed any output
stream slack
♦ Pipeline balancing (by retiming)
♦ Process Serialization
• Serial arithmetic for process with low throughput, high slack
12/2/05
Eylon Caspi
61
Stream Buffer Sizing
♦ Fixed size buffers in hardware
• For minimum area
• For performance
(want smallest feasible queue)
(want deep enough to avoid stalls from
producer-consumer timing mismatch)
♦ Semantic gap
• Buffers are unbounded in TDF,
•
•
•
12/2/05
bounded in HW
Small buffer may create
artificial deadlock (bufferlock)
Theorem: memory bound
is undecidable
for a Turing complete
process network
In practice, our buffering
requirements are small
Eylon Caspi
Bounded
x=
y=
x
=x
y
=y
Unbounded
x=
y=
x
=x
y
=y
62
Dealing with Undecidability
♦ Handle unbounded streams
• Buffer expansion [Parks ‘95]
♦
Detect bufferlock, expand buffers
• Hardware implementation
♦
Buffer expansion = rewire to another queue
♦
Storage in off-chip memory or queue bank
♦ Guarantee depth bound for some cases
• User depth annotation
• Analysis
♦
Identify compatible SFSMs with
balanced schedules
♦ Detect bufferlock and fail
12/2/05
Eylon Caspi
63
Interface Automata
de Alfaro + Henzinger,
Symp. Found. SW Eng.
(FSE) 2001
♦ A finite state machine that transitions on I/O actions
• Not input-enabled (not every I/O on every cycle)
♦ G = (V, E, Ai, Ao, Ah, Vstart)
•
•
•
•
Ai
Ao
Ah
E
=
=
=
⊂
input actions
output actions
internal actions
V x (Ai ∪ Ao ∪ Ah) x V
x? (in CSP notation)
y!
”
z;
”
(transition on action)
♦ Execution trace = (v, a, v, a, …)
s
st;
t
S
s?
12/2/05
o!
s
o
F
f?
t
f
T’
S’
sf;
f
T
t?
(non-deterministic branching)
F’
o!
Eylon Caspi
select
o
64
Automata Composition
♦ Composition ~ product FSM with synchronization
(rendezvous) on common actions
Composition edges:
x
A
y
B
x?
z
A
(I)
A’
step A on unshared action
(ii) step B on unshared action
(iii) step both on shared action
y!
x?
Compatible
Composition
→
Bounded
Memory
12/2/05
AB
B
x?
A’B
AB
y;
z!
y?
B’
z!
z!
AB’
A’B’
A’B
y!
y? z!
x?
z!
AB’
y?
A’B’
x?
y!
Automata
Composition
Direct Product
65
Stream Buffer Bounds Analysis
♦ Given a process network, find minimum
buffer sizes to avoid bufferlock
♦ Buffer (queue) is also automaton
x
i
A
♦ Symbolic Park’s algorithm
o
B
y
• Compose network using
•
arbitrary buffer sizes
If deadlock, try larger sizes
x
i
A
♦ Practical considerations:
avoiding state explosion
• Multi-action automata
• Know which streams to expand first
• Compose pairwise in clever order
♦
12/2/05
0
Eylon Caspi
i?
1
o!
Q
o
B
y
i?
Composition is associative
• Cull states reaching deadlock
• Partition system
Q x
2
o!
i?o!
66
SFSM Decomposition
♦
(Partitioning)
Why decompose
• To improve locality
• To fit into custom page resources
♦
Decomposition by state clustering
• 1 state (i.e. 1 cluster) active at a time
♦
Cluster states to contain transitions
• Fast local transitions, slow external trans.
• Formulation: minimize cut of transition
probability under area, I/O constraints
♦
Similar to:
•
•
•
•
12/2/05
VLIW trace scheduling
[Fisher ‘81]
FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]
GarpCC HW/SW partitioning [Callahan ‘00]
VM/cache code placement
Eylon Caspi
State flow
Data flow
67
Early SFSM Decomposition Results
♦ Approach 1: Balanced, multi-way, min-cut
• Modified Wong FBB [Yang+Wong, ACM ‘94]
• Edge weight is mix: c*(transition probability) + (1-c)*(wire bits)
• Poor at simultaneous I/O constraint + cut optimization
♦ Approach 2: Spectral order + Extent cover
• Spectral ordering clusters connected components in 1D
♦
Minimize squared weighted distance, weight is mix (as above)
• Then choose area + I/O feasible extents [start, end]
•
using dynamic programming
Effective for partitioning to custom page resources
♦ Under 2% external transitions
• Amdahl’s law: few slow transitions ⇒ small performance loss
• Achievable with either approach
12/2/05
Eylon Caspi
68
Summary
♦ Streaming addresses large system design challenges
• Growing interconnect delay
• Flexibly timed module interfaces
• Design complexity
• Reuse
♦ Methodology to compile SCORE applications to Virtex-II Pro
• Language + compiler support for streaming
♦ Characterized 7 applications on Virtex-II-Pro
• Queue area ~38%
♦
• Flow control FSM area ~6%
Improve by merging SFSMs, eliminating queues
♦ Stream pipelining
• For logic
• For interconnect
♦ Stream based optimizations
• Pipelining • Queue sizing
• Placement • Serialization
12/2/05
• Module Merging • Partitioning
Eylon Caspi
69
Supplemental Material
12/2/05
Eylon Caspi
70
TDF ∈ Dataflow Process Networks
♦ Dataflow Process Networks
[Lee+Parks, IEEE May ‘95]
• Process enabled by set of firing rules: R = { r1, r2, … , RK }
• Firing rule = set of input patterns:
ri = ( ri,1, ri,2 , … , ri,M )
• Feedback arc for state
• Firing rule(s) per state
state
♦ DF process for a TDF operator:
process
Patterns match state + input presence
♦ E.g. for state σ : rσ = ( [σ], rσ,1, rσ,2 , … )
♦ Patterns:
rσ,j = [*]
if input j is
in input signature of state σ
rσ,j = ⊥
if input j is not in input signature of state σ
♦
• Single firing rule per state = DFPN sequential firing rules
• Multiple firing rules per state translate the same way,
with restrictions to retain determinism
12/2/05
Eylon Caspi
71
SFSM Partitioning Transform
♦
Only 1 partition active at a time
• Transform to activate via streams
♦
A
C
B
D
New state in each partition: “wait”
• Used when not active
• Waits for activation
from other partition(s)
• Has one input signature
(firing rule) per activator
♦
Firing rules are not sequential,
but determinism guaranteed
• Only 1 possible activator
♦
A
Activation streams from
given source to given dest.
partitions can be merged +
binary-encoded
12/2/05
Eylon Caspi
B
{C,D}
{A,B}
Wait
Wait
AB
CD
{C,D}
{A,B}
C
D
72
Virtual Paged Hardware (SCORE)
♦
Compute model has unbounded resources
♦
Paging
♦
Efficient virtualization
• Programmer does not target a particular device size
• “Compute pages” swapped in/out (like virtual memory)
• Page context = thread (FSM to block on stream access)
• Amortize reconfiguration cost over an entire input buffer
• Requires “working sets” of tightly-communicating pages to fit on device
compute
pages
buffers
Transform
Quantize
RLE
Encode
73
Download