Placer LUT Merging

advertisement
Architecture-Specific Packing
for Virtex-5 FPGAs
Taneem Ahmed, Paul Kundarewich, Jason Anderson,
Brad Taylor, Rajat Aggarwal
February 25th, 2008
Overview
•
•
•
•
Virtex-5 6-LUT Packing
Virtex-5 DSP and Block RAM Packing
Results
Summary
2
Simplified FPGA Logic Element
A4
A3
A2
A1
3
4-LUT
O4
FF
Simplified FPGA Logic Block
4-LUT
4-LUT
General
Interconnec
t
4-LUT
4-LUT
4
FF
FF
FF
FF
General
Interconnec
t
Virtex-5 Logic Block
CLB
General
Interconnec
t
5
SLICE
6-LUT
FF
6-LUT
FF
6-LUT
FF
6-LUT
FF
General
Interconnec
t
SLICE
6-LUT
FF
6-LUT
FF
6-LUT
FF
6-LUT
FF
Dual-Output 6-LUT
A6
A5
A4
A3
A2
A1
6
O6
6-LUT
O5
Dual-Output 6-LUT Usage
A6
A5
A4
A3
A2
A1
5-LUT
O6
5-LUT
7
O5
Dual-Output Packing
6-LUT
6-LUT
VCC
A6
a
A6
x
ab
A5
A4
b
xA3
yA2
A5
y
A4
A3
A2
A1
5-LUT
Logic
Y
A1
YO6
Logic
5-LUT
X
X
O5
2
Number of 6-LUTs used: 1!
8
5-LUT
Logic
X
O6
X
5-LUT
O5
Virtex-5 LUT/FF Pair
F7
A
XOR
CY
O6
6-LUT
O5
O6
O6
O5
F7
CY
XOR
AX
AX
CIN
9
AMUX
F7
O5
AQ
FF
Dual-Output Packing Tradeoff
F7
O6
6-LUT
O5
O5
O6
O6
O5
FF
AX
10
Dual-Output Packing in Placer
• Goal: To reduce area without performance hit
– Can be done pre-placement
• Will be sub-optimal without delay estimates
– Use delay estimates available during placement to
make good decisions on when to merge two LUTs
• Approach:
– Allow second 5-LUT to be used, when performance
impact is small
– Incorporate LUT packing in placer’s cost function
11
Placer Cost Function
• Previous cost function:
– Cost = a * W + b * T
– W: wirelength cost T: timing performance cost
• Extend cost function with two new terms
– One based on 6-LUT utilization (L)
– One based on SLICE utilization (S)
– Cost = a * W + b * T + c * L + d * S
12
6-LUT Utilization Term
• L is computed based on all the used 6-LUT slots
• Where
13
SLICE Utilization Term
• S is computed based on all the available SLICEs
m
S=

Si
i=0
• Let:
– Ni = Number of used 5-LUTs in SLICE i (at most 8)
14
Performance Recovery
• Helpful to prohibit pack in certain cases for
performance reasons
• Other used elements in a SLICE may block the
“good” path from the O5 output to external
interconnect.
15
Performance Recovery: XOR
F7
A
XOR
CY
O6
LUT6
LUT6
O5
O6
O6
O5
F7
CY
XOR
AX
AX
CIN
16
AMUX
F7
O5
FF
AQ
Performance Recovery: F7
F7
A
XOR
CY
O6
LUT6
O5
O6
O6
O5
F7
CY
XOR
AX
AX
CIN
17
AMUX
F7
O5
AQ
FF
6-LUT Reduction
% 6-LUT Reduction
16
14
12
10
8
5.5% 6-LUT
Reduction
6
4
2
0
Benchmark Design #
18
SLICE Reduction
% SLICE Reduction
25
20
10.23% SLICE
Reduction
15
10
5
0
Benchmark Design #
19
Performance Results
25
Performance Loss (%)
20
15
3.3% Performance
Degradation
10
5
0
0
5
10
15
-5
-10
-15
SLICEs Reduction (% )
20
20
25
Overview
• Virtex-5 6-LUT Packing
• Virtex-5 DSP and Block RAM Packing
• Summary
21
New Type of Packing Problem
• Traditionally, packing is considered to be a problem
of just LUTs and flops
• However, Virtex-5 contains large IP blocks that
present their own packing problem
22
Virtex-5 Block RAMs
36Kb RAM
•
A 36 Kbit block RAM tile can store:
a) single 36 Kb RAM
b) two independent 18 Kb RAMs
•
Block RAM has configurable “aspect ratio”
• 18 Kb RAM can be configured as:
16K x 1, 8K x 2, 2K x 9, or 1K x 18
•
Tools decide which independent 18 Kb
block RAMs to locate in which tile
23
18 Kb RAM
18 Kb RAM
Virtex-5 DSP48E Block
• A multiply-accumulate operation, pervasive in DSP
circuits, can be realized in a single DSP48E.
• Multiple DSP48Es can be chained together to form more
complex functions through the PCIN and PCOUT ports
PCOUT
X
ALU
=
Pattern detect
Optional pipeline register/
routing logic
C (48-bit)
25x18
Routing logic
B (18-bit)
A (25-bit)
Optional pipeline register/
routing logic
48-bit
P
PCIN
24
Block RAM and DSP Floorplan
• Block RAM and DSP48E tiles are organized in
columns
25
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
Virtex-5
DSP tile
Block
RAM tile
Block
RAM tile
Block
RAM tile
Block
RAM tile
Block RAM/DSP Packing
• Problem: Placer algorithms are heuristic and
sometimes do not find an optimal block RAM
packing
• Goal: Leverage preferred block RAM packing
patterns to achieve high performance
• Target area: DSP designs
– DSP designs make heavy use of block RAMs and
DSP blocks
26
DSP Block RAM Designs
• Most common DSP application is the
Finite Impulse Response Filter or FIR filter
– FIR filters have multiple instances of a “tap” which
involve DSP and block RAMs
27
FIR Filter
• A Finite Impulse Response or FIR filter is a digital filter that
takes a weighted average of the signals in a delay line
• An N-tap filter can be expressed as:
y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]
– Where:
• y[n] is the output of the filter at time n
• x[n] is the data input “signal” at time n
• Ci is the coefficient
• Each coefficient/data product in sum is referred to as a “tap”
– DSP units used for the multiply and accumulate
– Block RAMs used to store the data and coefficients
28
FIR Designs – Use Case 1
• 2-tap FIR filter involving small block RAMs
data
output
B
DSP1
Tap 1
DSP0
Tap 0
A
18 Kb block RAM
RAMD1
PCIN
RAMC1
PCOUT
36 Kb block RAM Tile
B
data
input
RAMD0
Data RAM
29
RAMC0
Coefficient RAM
Packing for Use Case 1
• Packing both 18k Block RAMs into a Block RAM
tile permits a natural alignment between the DSP
and Block RAMs
Operates as two independent
18 Kb block RAMs
30
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
DSP48E
Virtex-5
DSP tile
DSP48E
DSP48E
DSP48E
High
Performance!
FIR Designs – Use Case 2
• 2-tap FIR filter involving larger block RAMs
B
A
RAMD1
DSP1
PCIN
RAMC1
Tap 1
PCOUT
B
A
RAMD0
Data RAM
31
DSP0
18 Kb block
RAM
RAMC0
Coefficient RAM
Tap 0
36 Kb block
RAM
Packing for Use Case 2
• Two Block RAM columns feed one DSP column
• Again provides a natural alignment between the
DSP and Block RAMs
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
Block
RAM tile
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
Virtex-5
DSP tile
32
Block
RAM tile
Block
RAM tile
Block
RAM tile
Block
RAM tile
Block RAM Chains
• Use Case: 18k Block RAM’s data input and output pins
connected together (e.g. FIFO)
in
RAM0
RAM1
dia
dib
doa
addra
dob
out
addrb
18 Kb block RAM
• Algorithm: Look for such chains and pack them together
into single block RAM tile
• Special Case: 18k block RAMs separated by registers
33
Block RAM/DSP
Packing Results
Circuit
Circuit 1
Perf RAM
Perf. Baseline
Packing (MHz) (MHz)
500
400
Percent
Improvement
25%
Circuit 2
450
365
23%
Circuit 3
500
470
6%
Circuit 4
425
435
-2%
Circuit 5
215
200
8%
Geomean
400
359
11%
34
Summary
• Described two architecture specific packing
approaches for a 65nm commercial FPGA:
Xilinx Virtex-5
– Dual-output LUT packing in placement:
• Achieves 10.2% SLICE reduction and 5.5% LUT reduction
– Packing for DSPs and block RAMs:
• Achieves 11% performance improvement
35
Questions
36
Download