Accelerator-Rich Architectures: ARC, CHARM, BiN

advertisement
C
C
$2
C
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
ABM
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
$2
C
Core
A Accelerator
$2 L2 Bank
ABM
C
$2
Router
Accelerator
& BiN
Manager
CDSC CHP Prototyping
Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat,
Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou
1
Accelerator-Rich Architectures: ARC, CHARM, BiN
C
$2
C
$2
C
$2
C
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
ABM
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
A
$2
C
Core
A Accelerator
$2 L2 Bank
ABM
Router
Accelerator
& BiN
Manager
2
Goals
 Implement
the architecture features & supports into the
prototype system
 Architecture Proposals
•
•
•
•
Architecture-rich CMPs
CHARM
Hybrid cache
Buffer-in NUCA etc
 Bridge different thrusts in CDSC
3
Server-Class Platform: HC-1ex Architecture
4
XC6vlx760 FPGAs
Xeon Quad Core LV5408
80GB/s off-chip bandwidth
40W TDP
90W Design Power
Tesla C1060
100GB/s off-chip bandwidth
200W TDP
4
Drawback of the Commodity Systems
 Limited ability
to customize from the architecture point of
view
 Board-level
integration rather than chip-level integration
 Commodity systems
can only reach certain-level, we need
further innovations
5
CHP Prototyping Plan
 Create
the working hardware and software
 Use FPGA Extensible Processing Platform (EPP) as the
platform
• Reuse existing FPGA IPs as much as possible
 Working in multiple phases
6
Target Platforms: Xilinx ML605 and Zynq


Virtex6-based board
Dual-core A9 with programmable logics
7
CHP Prototyping Phases

ARC Implementation
 Phase 1: Basic platform
• Accelerator and Software GAM
 Phase 2: Adding modularity using available IP
• E.g. Xilinx DMAC IP
 Phase 3: First step toward BiN
• Shared buffer
• Customized modules (e.g. DMA-controller, plug-n-play accelerator)
 Phase 4: System Enhancement
• Crossbar
• AXI implementation

CHARM Implementation
8
ARC Phase 1 Goals
Setting up a
basic environment
 Multi-core + simple accelerators + OS
• Understanding the system interactions in more detail
 Simple controller as GAM (global accelerator manager)
• Supports sharing at system-level for multiple accelerators
of a same type
9
ARC Phase 1 Example System Diagram
Microblaze-0
(Linux with MMU)
Mailbox
FSL
(vecadd)
Mailbox
FSL
Microblaze-1 (GAM)
(Bare-metal; no MMU)
(vecsub)
FSL
FSL
AXI4 (xbar)
AXI4lite (bus)
timer mutex
DDR3
uart
vecadd vecsub vecadd vecsub
10
ARC Phase-2 Goals

Implementing a system similar to ARC original design
 GAM, Accelerator, DMA-Controller, SPM
 Adding modularity
using available IP
 E.g. Xilinx DMAC IP
11
ARC Phase-2 Architecture
12
ARC Phase-2 Performance and Power Results
Benchmarking
kernel:
for i = 0...4096
y (i)  x(i)  x(i)  2  x(i)  3  x(i)  4  x(i)  5  x(i)  6  x(i)  7  x(i)  8  x(i)  9
Results
Runtime (us) Power (W)
EDP (Energy delay
product) Gain
CHP prototye on Xilinx FPGA ML605
@ 100MHz
1,746
2
17,570X
2x Quad-core Intel Xeon CPU E5405
x64 @ 2.00GHz, 1 FPU per core
562
80
1,365X
10,061
65
94X
852,163
72
1X
Dual-core Intel Xeon CPU 5150 x32
@ 2.66GHz, 1 FPU per core
16-Core UltraSPARC T1 @ 1.2 GHz,
1 shared FPU
ARC Phase-2 Runtime Breakdown
DMAC
DMAC
transfers Page transfers Page
0
1
DMAC wrapper
requests Page 0
DMAC wrapper
requests Page 1
DMAC
P0
GAM
reserves acc
GAM
P0
0
100
Reservation
request sent
200
Parameter sent
Reservation
succeded
DMAC
transfers Page
3
DMAC wrapper
requests Page 3
P2
P3
Acc computes
GAM passes
parameter
Core
DMAC wrapper
requests Page 2
P1
Acc wrapper
partitions task
ACC
DMAC
transfers Page
2
Acc done
GAM passes
done signal
11.91 us
P1
P2
400
300
Page 0
Page 1 translated
translated
P3
500
Page 2
translated
600
Page 3
translated
Task
done
700
us
Acc freed
ARC Phase-2 Area Breakdown
Slice
Logic Utilization
 Number of Slice Registers: 45,283 out
of 301,440: 15%
 Number of Slice LUTs: 40,749 out of
150,720: 27%
AXI
• Number used as logic: 32,505 out of
•
 Slice
150,720: 21%
Number used as Memory: 5,248 out of
58,400: 8%
DRAM
Controller
DMAC
wrapper
AXILite
Logic Distribution:
 Number of occupied Slices: 17,621 out
of 37,680: 46%
 Number of LUT Flip Flop pairs used:
54,323
Ethernet
DMA
AXI
Accelerator
•
•
Microblaze
(Linux)
DMAC
• Number with an unused Flip Flop:
14,617 out of 54,323: 26%
Number with an unused LUT: 13,574
out of 54,323: 24%
Number of fully used LUT-FF pairs:
26,132 out of 54,323: 48%
Microblaze
(GAM)
Ethernet
DMA
AXI
Ethernet
AXI
ARC Phase-3 Goals

First step toward BiN:
 Shared buffer

Designing our customized modules
 Customized DMA-controller
• Handles batch TLB misses
 Plug-n-play accelerator design
• Making the interface general enough at least for a class of
accelerators
ARC Phase-3 Architecture
A
partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)
 Global accelerator manager (GAM) for accelerator sharing
 Shared on-chip buffers: Much more accelerators than buffer bank resources
 Virtual addressing in the accelerators, accelerator virtualization
 Virtual addressing DMA, with on-demand TLB filling from core
 No network-on-chip, no buffer sharing with cache, no customized instruction in the core
AXI
DRAM
GAM
Ethernet
MDM
UART
Mutex
CoreGAM
Mail
box
0
CoreIOMMU
Mail
box
1
Core
INTC
Timer
AXILite
ACC
wrapper 0
ACC
wrapper 1
ACC
wrapper 2
ACC
wrapper 3
ACC0
ACC1
ACC2
ACC3
IOMMU
DMAC0
DMAC1
DMAC2
DMAC3
Buffer0
AXI_B3
Buffer1
AXI_B2
Buffer2
AXI_B1
Buffer3
AXI_B0
Bus master
AXI Bus
AXILite Bus
Bus slave
FSL
AXIStream
Performance and Power Results
Benchmarking
kernel:
for i = 0...4096
y (i)  x(i)  x(i)  2  x(i)  3  x(i)  4  x(i)  5  x(i)  6  x(i)  7  x(i)  8  x(i)  9
Results
Runtime (us) Power (W)
EDP (Energy delay
product) Gain
CHP prototye on Xilinx FPGA ML605
@ 100MHz
1,802
2
8,050,786X
2x Quad-core Intel Xeon CPU E5405
x64 @ 2.00GHz, 1 FPU per core
562
80
2,069,261X
10,061
65
7,947X
852,163
72
1X
Dual-core Intel Xeon CPU 5150 x32
@ 2.66GHz, 1 FPU per core
16-Core UltraSPARC T1 @ 1.2 GHz,
1 shared FPU
Impact of Communication & Computation Overlapping
IOMMU
requests
P0
DMAC
D-0
Pages 0-4
Acc wrapper
ACC partitions task
GAM
GAM passes
GAM reserves
parameter
Acc
Pages 0
translated
Core
0
100
200
300
P1
D-1
P1
R1 W3
P0
R0 W2
500
Pages 2
translated
600
GAM
reserves
Acc
P2
R0 W2
P3
R1 W3
700
Pages 4
translated
800
900
Pages 5
Pages 3
translated translated
1000
1100
1200
Pages 6
translated
1300
1400
Pages 7
translated
1500
1700
1800
1900
IOMMU
requests
Pages 5-9
Acc wrapper
partitions task
200
300
2100
2200
DMAC transfers
Pages 6-9
Acc computation
GAM passes
done signal
GAM passes
parameter
Reservation Parameter sent
request sent
Reservation succeded
2000
19%
Task done
Pages 0-4
translated
100
1600
Pipelined Communication & Computation
DMAC transfers
Pages 0-4
Core
0
P7
3-D
P6
2-D
Acc freed
IOMMU
requests
Pages 0-4
ACC GAM
P5
3-D
us
Reservation Parameter sent
request sent
Reservation succeded
DMAC
P3
D-1
GAM passes
done signal
Pages 1
translated
400
P4
2-D
P2
D-0
400
Pages 5-9
translated
500
600
700
800
900
1000
1100
1200
No pipeline
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
us
Acc freed
Task done
Overhead of Buffer Sharing: Bank Access Contention (1)
IOMMU
requests
P0
DMAC
D-0
Pages 0-4
Acc wrapper
ACC partitions task
GAM
GAM passes
GAM reserves
parameter
Acc
Pages 0
translated
Core
0
100
200
300
P1
D-1
Reservation Parameter sent
request sent
Reservation succeded
P3
D-1
P1
R1 W3
P0
R0 W2
P5
3-D
P7
3-D
P6
2-D
P2
R0 W2
P3
R1 W3
GAM passes
done signal
Pages 2
translated
Pages 1
translated
400
P4
2-D
P2
D-0
500
600
700
Pages 5
Pages 3
translated translated
Pages 4
translated
800
900
1000
1100
1200
Pages 6
translated
1300
1400
us
Pages 7
translated
1500
1600
The 4 logic buffers are allocated to 4 separate buffer banks
1700
1800
Pages 0
translated
Core
0
100
200
300
Reservation Parameter sent
request sent
Reservation succeded
P1
D-1
2100
2200
2100
2200
Task done
3.2%
P3
D-1
P0
R0 W2
P0
R0 W2
P5
3-D
P7
3-D
P6
2-D
P0
R0 W2
P0
R0 W2
GAM passes
done signal
Pages 2
Pages 4
translated translated
Pages 1
translated
400
P4
2-D
P2
D-0
2000
Acc freed
Reason: AXI bus allow masters simultaneously issue transactions.
and the AXI transaction time dominates buffer access time
IOMMU
requests
P0
DMAC
D-0
Pages 0-4
Acc wrapper
ACC partitions task
GAM
GAM passes
GAM reserves
parameter
Acc
1900
500
600
700
800
900
Pages 3
translated
1000
1100
Pages 5
translated
Pages 6
translated
1200
1400
1300
us
1500
The 4 logic buffers are allocated to 1 buffer bank
Pages 7
translated
1600
1700
1800
1900
Acc freed
Task done
2000
Overhead of Buffer Sharing: Bank Access Contention (2)
DMAC
IOMMU
requests
Pages 0-4
IOMMU
requests
Pages 5-9
DMAC transfers
Pages 0-4
Acc wrapper
partitions task
ACC GAM
reserves
GAM Acc
DMAC transfers
Pages 6-9
Acc computation
GAM passes
done signal
GAM passes
parameter
Pages 0-4
translated
Core
0
100
200
400
300
Pages 5-9
translated
500
Reservation Parameter sent
request sent
Reservation succeded
600
700
800
900
1000
1100
1200
1300
1400
us
1500
1600
1700
1800
1900
2000
2200
2100
Acc freed
The 4 logic buffers are allocated to 4 separate buffer banks
Task done
2.7%
DMAC
IOMMU
requests
Pages 0-4
ACC GAM
reserves
GAM Acc
IOMMU
requests
Pages 5-9
DMAC transfers Pages
6-9
Acc wrapper
partitions task
DMAC transfers Pages
6-9
Acc computation
GAM passes
done signal
GAM passes
parameter
Pages 0-4
translated
Core
0
100
200
300
Reservation Parameter sent
request sent
Reservation succeded
400
Pages 5-9
translated
500
600
700
800
900
1000
1100
1200
1300
1400
us
1500
1600
The 4 logic buffers are allocated to 1 buffer bank
1700
1800
1900
2000
2100
2200
2300
Acc freed
Task done
Area Breakdown
Slice
Logic Utilization
 Number of Slice Registers: 105,969
out of 301,440: 35%
 Number of Slice LUTs: 93,755 out of
150,720: 62%
• Number used as logic: 80,410 out of
•
 Slice
150,720: 53%
Number used as Memory: 7,406 out of
58,400: 12%
Logic Distribution:
 Number of occupied Slices: 32,779 out
of 37,680: 86%
 Number of LUT Flip Flop pairs used:
112,772
• Number with an unused Flip Flop:
•
•
25,037 out of 112,772: 22%
Number with an unused LUT: 19,017
out of 112,772: 16%
Number of fully used LUT-FF pairs:
68,718 out of 112,772: 60%
AXI-DDR
IOMMU
DDR
Controller
Ethernet
DMA
DMAC0
AXILite
Microblaze1
(GAM)
Microblaze0
(Linux)
DMAC1
DMAC2
DMAC3
Accelerator
(Sum of 10 SQRTs)
AXI-BUF0
BUF0CRTL
Buffer
Selectors
AXIBUF2
AXI-BUF1
BUF1CRTL
BUF2CRTL
AXI-BUF3
BUF3CRTL
Ethernet
Phase-4 ARC Goals

Finding bottlenecks and system enhancement

Communication bottleneck
 Crossbar design instead of AXI-bus
 Speed-up AXI non-burst implementation
Accelerator Memory System Design
Crossbar
OC core
AXI buses
• Main memory
• Shared buffer banks
Buffer bank3
 # of buffer banks can be large
 want to keep AXI bus size
 Hierarchical DMACs and buses
to DDR
LCA2
 Data transfer between
Buffer bank2
LCA3
Hierarchical DMACs
Buffer bank1
DMAC3

Main AXI bus
 Passed on-board test
DMAC1
• will not affect working LCAs
DMAC2
 now support partial configuration
IOMMU
LCA1
 In addition to previously proposed
Buffer bank4
LCA4

Buffer bank9
Select-bit Receiver
GAM
24
Crossbar Results

25
Download