Fracturable luts

advertisement
Simulation of Fracturable LUTs
Tim Pifer
Presentation Overview
• Altera ALM from Stratix II
• Stratix V architecture
• Current VPR method for Fracturable LUTS
– Wiremap for technology mapping
– AApack for packing
Altera Adaptive Logic Module
• Traditional 4LUTs provide the best area-delay product
• Larger LUTs
–
–
–
–
Shorter critical path
Absorb more logic
Larger LUT mask
More input Muxing
• reduce critical path depth by 20% , improving area
Improving FPGA Performance and Area Using an Adaptive Logic Module
Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan, Sinan
Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark Bourgeault, Andy
Lee, Henry Kim and Rahul Saini
Motivation from other architectures
•
•
•
•
•
BLE5 - 15% fewer LUTs , 25% shorter unit delay
BLE6 - 22% fewer LUTs , 36% shorter unit delay
BLE7 - 28% fewer LUTs , 46% shorter unit delay
K=6 25% 6LUT
Example design
– K=4 100 LUTs
– K=6 78 LUTs : 23 6LUT,32 5LUT,17 4LUT,9 3LUT,13
2LUT
• Stratix, VirtexII : 30:1 input mux
• relative contribution of routing area and
interconnect delay increase with each generation
of fabrication
Simple Example: 6LUT from 4BLEs
• Larger Area
• 19 input, 4 registers
• For 6-LUT, 4LUTs
have identical
inputs, separate
input muxes
• 3 /4 registers,
outputs wasted
Improved Example: 6,2 Fracturable LE
• 8 Inputs, 2 outputs, 2
registers
• 1 6LUT
• 2 5LUTs with input
Sharing
• 2 independent 4 LUTS
• comparable in area
with two BLE4
• Functionally closer to
two BLE5 logic
elements.
Final Version
•
•
•
•
•
•
•
•
•
Composed of 3LUTs
Added d2 muxed output
c1 or GND ,c2 or VCC muxed
remove mux from d1
swap muxes controlled by R
and T
two 6LUTs share 4 inputs,
identical LUT-mask
4:1 muxes, common data,
different select lines
Up to 12% pairs of 6-LUTs
R=0 T=1 S=1 implements 2
muxed 5Luts with 7 inputs
– F1 = fn(a1,a2,b1,b2,d1)
– F2 = fn(a1,a2,b2,c2,d1)
– Out = mux(F1,F2,c1)
• roughly area-neutral with BLE4
and 36% decrease in logic
depth
How do we set RSTU for a 6LUT?
8:1 mux implementation
• 8:1 mux in 2 ALMs (4 ALUTs) using 7 input functions
• second ALM computes output
– F1=fn(s0,s1,d3,y0,y1)
– F2=fn(s0,s1,d7,y0,y1)
– mux controlled by s2
• 5 BLE4 vs. 2 ALMs, saves one BLE4
Stratix V
• ALM can become
2 4LUTS
• eight inputs for
both ALUTs
• backwardcompatible with
4LUT
architectures
Logic Array Blocks and Adaptive Logic Modules in Stratix V Devices
Stratix V
Normal Modes
Normal LUT mode
• single 6LUT mode, other inputs used for
registers
Extended LUT mode
• 7 input function
• 2-to-1 multiplexer with two 5LUTS sharing 4
inputs.
• If Else statements
Why 6LUTS: DES Example
• DES : 8 sboxes or
substitution tables
• sbox has 6 inputs,
4 outputs
• Each output:
– 1 6LUT
– 6 4LUTs.
• 35-45% less area
How would we alter technology
mapping to best support FLUTs?
Technology Mapping
• 1 4LUT and 2
6LUTs requiring 3
ALMs
• Could use 4
5LUTs requiring 2
ALMs and the
same logic depth
Balancing Technology Mapping
• Must maintain optimal
critical path depth, more
packable LUT
distribution
• avoid 6-LUTs when not
helping delay
• 8:1 muxes identified
separately and mapped
to 7 input functions
• 7% of ALMs are 7-input
functions
Results: Performance
• 80 designs
tested
• 130nm
process
• Minimum
chip size
used
• Spice
models for
delay
Results: Area
Stratix vs. Stratix II
Conclusions
• Benefits of 6LUTS without underutilization
• Larger LUT Costs:
– LUT-mask size
– input and output muxing
– FFs
• 6-LUT is fracturable into 5 LUTs, area comparable
to 2 BLE4s
• 7-input functions and 6 input pairs
• Technology mapping support is needed for best
results
• 6,2 Alm vs 4BLE:
– 15% better performance
– 12% smaller area average
How do we need to alter VPR to
support FLUTs?
AAPack and wiremap
• ABC with Wiremap
technology mapping
to primitives
• AApack- capable of
packing complex
logic blocks based
on logic primitives
Wiremap
• reduces 6LUTs
percentage
• Does not increase:
– logic depth
– total LUT count
WireMap: FPGA Technology Mapping for Improved
Routability
Stephen Jang, Billy Chan, Kevin Chung, Alan Mishchenko
AAPack Overview
• Current tools can’t support the complexity of
logic blocks
• New logic block description language:
– Depict complex interconnects
– Hierarchy
– Modes of operation
•
•
•
•
Can pack complex blocks
Area driven
Area is compared to the theoretical minimum
Verilog input for large benchmarks
Architecture Description and Packing for Logic Blocks with Hierarchy, Modes
and Complex Interconnect
Jason Luu, Jason Anderson, and Jonathan Rose
Example: Virtex-6
Logic Block
• Tools don’t
support Stratix IV
or Virtex 6
• Virtex 6:
– complex soft logic
blocks
– hard memories
– multipliers
What AAPack does
• Can describe:
– complex logic blocks with arbitrary internal
routing structures
– Variable memory configurations: 4Kx8, or 8Kx4, or
16Kx2
• area-driven packing
• inputs:
– user design
– architectural description
Complex Block Description Language
• Expressive: The language should be capable of
describing a wide range of complex blocks.
• Simple: The language constructs should match
closely with an FPGA architect’s existing
knowledge and intuition.
• Concise: The language should permit complex
blocks to be described as concisely as possible.
Physical blocks
• Specified in XML
• Hierarchy
– Other blocks
and
– existing
primitives
• Inputs and
outputs and
clocks with pin
numbers
Primitives
• Common primitives
are handled in the
language
• LUTs inputs can be
reordered, a memory
address cannot
Intra-Block Interconnect
• Complete: crossbar switch –internal programmable signal
• direct: direct connection- wire connection, no programmability
• mux: multiplexed connection single-bit/bus - programmable
signal
Modes of Operation
• Mutually exclusive
functionality
• Represent FPGA
structures being used
in different ways
Packing Algorithm
• Input: technology mapped
Netlist, XML architecture
• Output: Packed complex
blocks
• Greedy algorithm similar to
other packing methods
• while until all blocks are
packed
• Seed block s selected and
packed
• New complex block B for s
• Pack additional blocks into B
– Choose a compatible block c
– Pack c into B if valid
• Add B to Packed list
Selecting Netlist and Complex Blocks
• Choose the block with the
most nets attached
• Candidates are selected based
on affinity in equation 1
• Affinity = shared nets and
connections divided by the
number of pins the new block
would add.
• Connections is a measure of
how likely the new block will
need external connections
• Alpha is set to .9
Legality: Location
• attempting to pack:
– chooses a location
– verifies routing
• traversing the complex
block as a tree
• ordered smallest to
largest right to left
• traversed right to left to
ensure smallest resource
consumption
• attempts to pack the
other nodes in the subtree: find a flip flop for a
LUT
• 30 packs on the sub-tree
Legality: Routablility
• Initially, check if packing would exceeded
external pin count
• Then, generate routing graph for complex
block
• Assume any output can connect to any input
of a complex block (switchbox architecture)
• Apply pathfinder
Memory
• Primitives are technology mapped with a
single bit width
• 256 X 8 memory mapped as 8 256X 1 bit
memories
• primitives mapped to same component if bus
signals identical
Limitations
• No support for timing in this implementation
• primitive can map to only one complex logic
block
– flip flop can only be used in a LUT complex block,
they cannot also be present in Multiplier complex
blocks
What are some faults of this packing
method?
Experiments
• Verilog
benchmarks
• soft
processors
• image
processors
Fracturable LUTS
• CLBs:
– fully connected BLEs
– FI X N – no pin sharing
– 8 BLEs
• BLEs :
– 1FLUT
– 2 flip flops
– 2 outputs
• FlUTs:
– 2 Modes
• 6Lut
• Dual 5LUTs
– Variable number of inputs
– Dual mode input sharing
depends on number of inputs
FLUT Evaluation
• Compare achieved
area with the lower
bound
• Lower bound: number
of complex blocks
needed to contain the
primitives without
routing considerations
• Efficiency : ratio of the
achieved number of
logic blocks and this
value
Efficiency Results
• Number of inputs FI
varied 5 – 10
• Geometric average across
5 benchmarks
• 5 indicates all inputs are shared,
10 indicates no inputs are shared
• 6 or 7 achieves tolerable
efficiency
Logic blocks and Channel width with
number of inputs
• # blocks decreases
to 7
• Channel width
from # inputs
– first increases from
more routing to
each block
– then decreases
after 7: full
efficiency so easier
routing
Memory
• Varied # bits and max
width
• best utilization:
smallest size,
maximum width
CLB consumption with memory size
• Smaller memories:
more logic due to
muxes
• Best results: multiple
memory sizes
Conclusions
• New language can describe complex
architectures using:
– Hierarchy
– Modes
– Arbitrary interconnects
• Packing algorithm for this architecture
• Verified on large benchmarks
• Needs timing driven packing
How can we get additional
improvement from technology
mapping?
Academic FLUTs soft logic
• 4 architectures :
–K=6
– M = 5,6,7,8
• M5: dual-output 6LUT of a Xilinx
Virtex 5
• M8: Stratix II ALM
Exploring FPGA Technology Mapping for Fracturable LUT Minimization
David Dickin, Lesley Shannon
BLE
• BLE:
– 1 FLUT
– 2 Registers
– 8 inputs
– 4 outputs
LUT Balancing Experiments
• WireMap - no LUT balancing
• WireMap - with LUT balancing
• increase the cost of LUT5, LUT6 from 1.0 to 2.5
in 0.1 increments
• Smaller LUT weighs unchanged
Varying 6LUT weight
Varying 5LUT and 6LUT weight
Different Architectures
Clock Frequency
FLUT Reduction by Architecture
Download