Application-Specific Memory Interleaving Enables High

advertisement
Application-Specific Memory Interleaving
Enables High Performance in FPGA-based
Grid Computations
Grid computation: candidate for acceleration
FPGAs: Technological opportunity
Traditional memory interleaving for broad parallelism – in use since 1960s
Generic:
Designed to avoid application specifics
Fixed bus:
All applications use same memory interface
Expensive:
High hardware costs, accessible only for major processor designs
Many applications in molecular dynamics, physics,
molecule docking, Perlin noise, image processing …
Computation characteristics being addressed:
 Cluster of grid points needed at each step
 Grid cells accessed in irregular order
Invalidates typical schemes for reusing data
 Working set fits into FPGA’s on-chip RAM
FPGAs for memory interleaving – ideal technological match
Customizable: Can adapt to arbitrary application characteristics
Not just permitted, customization is inherent and compulsory
Bilinear interpolation for computing off-grid points
Configurable: Unique interleaving structure for each application
Multiple different structures for different parts of one application
Implementing for FPGA computation
Free (almost) :
Allows reconsideration of the whole algorithm
Optimal FPGA algorithms are commonly very different from sequential implementations
Developer has access to algorithm’s logical indexing scheme
Extra design information in 2,3,... dimensional indexing, before flattening into RAM addresses
FPGAs support massive, fine-grained parallelism in computation pipeline
Often throttled by serial access to RAM operands
Goal: Fetch enough operands to fill the width of the computation array
10s to 100s of independently addressable RAM busses
On-chip bus widths 100s to 1000s of bits
Cheap, fast logic for address generation & de-interleaving networks
FPGA-based computation is an emerging field
Does not have software’s huge base of widely applicable techniques
Needs to develop a “cookbook” of reusable computation structures
Implementation technique
3b
3a
2. Round up to power of 2 bounding box
RAM banks indexed by {X, Y} mod 4
3d
3c
2b
2a
1
2d
1a
1
0
2c
1b
4. De-interleaving: Map RAM banks to outputs
3
1d
2
1c
0
1
0
1. Define application’s access cluster
Convert to rectangular array
0
1
2
3
0
1
0
3. Address generation: Map access cluster to grid
Handle wraparound: {X, Y} / 4 vs {X, Y} / 4 + 1
The general case, not just limited to 1D or 2D arrays
X
Y
3a
3b
3c
2a
2b
2c
1
MSBs
+1?
Address
generation
+1?
1
0
1b
1
2
3
RAM array
0
De-interleave
LSBs
A
0
Variations and extensions
1
Sample input for hex grid example above
name=HexGrid
axis=horiz
output=B1,0,1
output=B2,1,1
output=A3,2,0
output=C3,2,2
X0:1
Y0:1
Z0:1
output=A2,1,0
output=C2,1,2
output=B3,2,1
testSize=150,75
Name of VHDL component
Symbol names for axis indices
Width of individual word
Define the access cluster
Grid size for test bench
Output
Write port design choices
Can use dual-ported RAM for non-interfering, concurrent read & write
Write single words or clusters – need not be same shape as read cluster
1. Barnes, George H., Richard M. Brown, Maso
Kato, David J. Kuck, Daniel L. Slotnick, and
Richard A. Stokes. The Illiac IV Computer.
IEEE Transactions on Computers 17(8),
August 1968
axis=vert
databits=16
Take advantage of dual-ported RAMs, when available
Allocate less hardware to small grids
{tvancour, herbordt} @ bu.edu
D
Java program – initial version available
See http://www.bu.edu/caadlab/publications
Source code and documentation
Can use non-power-of-2 memory arrays
LSBs become X mod NX – efficient implementations for modest NX
MSBs become X div NX – efficient implementations using block multipliers
Allows wide range of design tradeoffs:
Logic & multipliers vs. RAMs
Latency vs. hardware
Tom VanCourt
Martin Herbordt
C
Automation
Dimensions: 1, 2, 3, …
Adapts easily to dimensionality. Adapts easily to cluster size & shape.
Optimize de-interleaving multiplexers
4x4x4 RAM array requires 64:1 output multiplexers
Implement efficiently as three layers of 4:1 multiplexers
B
VHDL output
HexGrid.vhdl
HexGrid_def.vhdl
HexGrid_test_driver.vhdl
2. Böhm, A.P.W., B. Draper, W. Najjar, J. Hammes,
R. Rinker, M. Chawathe, and C. Ross. One-step
Compilation of Image Processing Applications to
FPGAs. Proc FCCM. 2001
3. M. B. Gokhale and J. M. Stone. Automatic
Allocation of Arrays to Memories in FPGA
Processors With Multiple Memory Banks. Proc.
FCCM 1999
Synthesizable entity definition
Declaration package
Test bench - confirms implementation
BOSTO
UNIVERSITY
N
www.bu.edu/caadlab
Download