presentation - University of British Columbia

advertisement
TPUTCACHE: HIGH-FREQUENCY,
MULTI-WAY CACHE FOR HIGHTHROUGHPUT
FPGA APPLICATIONS
Aaron Severance
University of British Columbia
Advised by Guy Lemieux
1
Our Problem

We use overlays for data processing



Partially/fully fixed processing elements
Virtual CGRAs, soft vector processors
Memory:

Large register files/scratchpad in overlay



Low latency, local data
Trivial (large DMA): burst to/from DDR
Non-trivial?
2
Scatter/Gather


Data dependent store/load
vscatter adr_ptr, idx_vect, data_vect

for i in 1..N


adr_ptr[idx_vect[i]] <= data_vect[i]
Random narrow (32-bit) accesses

Waste bandwidth on DDR interfaces
3
If Data Fits on the FPGA…

BRAMs with interconnect network

General network…



Memory mapped BRAM


Not customized per application
Shared: all masters <-> all slaves
Double-pump (2x clk) if possible
Banking/LVT/etc. for further ports
4
Example BRAM system
5
But if data doesn’t fit…
(oversimplified)
6
So Let’s Use a Cache

But a throughput focused cache



Low latency data held in local memories
Amortize latency over multiple accesses
Focus on bandwidth
7
Replace on-chip memory or
augment memory controller?

Data fits on-chip



Want BRAM like speed, bandwidth
Low overhead compared to shared BRAM
Data doesn’t fit on-chip

Use ‘leftover’ BRAMs for performance
8
TputCache Design Goals





Fmax near BRAM Fmax
Fully pipelined
Support multiple outstanding misses
Write coalescing
Associativity
9
TputCache Architecture

Replay based architecture

Reinsert misses back into pipeline
Separate line fill/evict logic in background
Token FIFO for completing requests in order

No MSHRs for tracking misses





Fewer muxes (only single replay request mux)
6 stage pipeline -> 6 outstanding misses
Good performance with high hit rate

Common case fast
10
TputCache Architecture
11
Cache Hit
12
Cache Miss
13
Evict/Fill Logic
14
Area & Fmax Results
•Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV
•423MHz compared to 490MHz BRAM fmax on Stratix IV
•Minor degredation with increasing size, associativity
•13% to 35% extra BRAM usage for tags, queues
15
Benchmark Setup

TputCache


128kB, 4-way, 32-byte lines
MXP soft vector processor


16 lanes, 128kB scratchpad memory
Scatter/Gather memory unit


Indexed loads/stores per lane
Doublepumping port adapters

TputCache runs at 2x frequency of MXP
16
MXP Soft Vector Processor
DMA and Vector Work Queues, Instruction Decode & Control
Custom
Instructions
Address Generation
Nios II/f
I$
D$
M
M
DMA &
Custom
FIltering
BB
5 1
CC
6 2
AA
7 3
ALU0
Bank 0
S
BB
6 2
CC
7 3
M
Custom
AA
4 0
ALU1
Bank 1
BB
7 3
Avalon
Fabric
CC
4 0
Vector
AA
5 1
ALU2
Bank 2
S
BB
4 0
DDR
Control
Align 3
DstC
CC
5 1
Instructions
AA
6 2
Bank 3
Vector Scratchpad
Accum
ALU3
Align 1
SrcA
Align 2
SrcB
Custom Vector Instructions
Gather Data
Throughput
M
Cache
S
S/G
Scatter/Gather Addresses
M
Control
Scatter Data
17
Histogram
•Instantiate a number of Virtual Processors (VPs) mapped across lanes
•Each VP histograms part of the image
•Final pass to sum VP partial histograms
18
Hough Transform
•Convert an image to 2D Hough Space (angle, radius)
•Each vector element calculates the radius for a given angle
•Adds pixel value to counter
19
Motion Compensation
•Load block from reference image, interpolate
•Offset by small amount from location in current image
20
Future Work

More ports needed for scalability




Write cache



Share evict/fill BRAM port with 2nd request
Banking (sharing same evict/fill logic)
Multiported BRAM designs
Allocate on write currently
Track dirty state of bytes in BRAMs 9th bit
Non-blocking behavior

Multiple token FIFOs (one per requestor)?
21
FAQ

Coherency



Envisioned as only/LLC
Future work
Replay loops/problems


Random replacement + associativity
Power expected to be not great…
22
Conclusions

TputCache: alternative to shared BRAM



Low overhead (13%-35% extra BRAM)
Nearly as high fmax (253MHz vs 270MHz)
More flexible than shared BRAM


Performance degrades gradually
Cache behavior instead of manual filling
23
Questions?

Thank you
24
Download