C C $2 C $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 ABM $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 $2 C Core A Accelerator $2 L2 Bank ABM C $2 Router Accelerator & BiN Manager CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou 1 Accelerator-Rich Architectures: ARC, CHARM, BiN C $2 C $2 C $2 C $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 ABM $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 A $2 C Core A Accelerator $2 L2 Bank ABM Router Accelerator & BiN Manager 2 Goals Implement the architecture features & supports into the prototype system Architecture Proposals • • • • Architecture-rich CMPs CHARM Hybrid cache Buffer-in NUCA etc Bridge different thrusts in CDSC 3 Server-Class Platform: HC-1ex Architecture 4 XC6vlx760 FPGAs Xeon Quad Core LV5408 80GB/s off-chip bandwidth 40W TDP 90W Design Power Tesla C1060 100GB/s off-chip bandwidth 200W TDP 4 Drawback of the Commodity Systems Limited ability to customize from the architecture point of view Board-level integration rather than chip-level integration Commodity systems can only reach certain-level, we need further innovations 5 CHP Prototyping Plan Create the working hardware and software Use FPGA Extensible Processing Platform (EPP) as the platform • Reuse existing FPGA IPs as much as possible Working in multiple phases 6 Target Platforms: Xilinx ML605 and Zynq Virtex6-based board Dual-core A9 with programmable logics 7 CHP Prototyping Phases ARC Implementation Phase 1: Basic platform • Accelerator and Software GAM Phase 2: Adding modularity using available IP • E.g. Xilinx DMAC IP Phase 3: First step toward BiN • Shared buffer • Customized modules (e.g. DMA-controller, plug-n-play accelerator) Phase 4: System Enhancement • Crossbar • AXI implementation CHARM Implementation 8 ARC Phase 1 Goals Setting up a basic environment Multi-core + simple accelerators + OS • Understanding the system interactions in more detail Simple controller as GAM (global accelerator manager) • Supports sharing at system-level for multiple accelerators of a same type 9 ARC Phase 1 Example System Diagram Microblaze-0 (Linux with MMU) Mailbox FSL (vecadd) Mailbox FSL Microblaze-1 (GAM) (Bare-metal; no MMU) (vecsub) FSL FSL AXI4 (xbar) AXI4lite (bus) timer mutex DDR3 uart vecadd vecsub vecadd vecsub 10 ARC Phase-2 Goals Implementing a system similar to ARC original design GAM, Accelerator, DMA-Controller, SPM Adding modularity using available IP E.g. Xilinx DMAC IP 11 ARC Phase-2 Architecture 12 ARC Phase-2 Performance and Power Results Benchmarking kernel: for i = 0...4096 y (i) x(i) x(i) 2 x(i) 3 x(i) 4 x(i) 5 x(i) 6 x(i) 7 x(i) 8 x(i) 9 Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,746 2 17,570X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562 80 1,365X 10,061 65 94X 852,163 72 1X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU ARC Phase-2 Runtime Breakdown DMAC DMAC transfers Page transfers Page 0 1 DMAC wrapper requests Page 0 DMAC wrapper requests Page 1 DMAC P0 GAM reserves acc GAM P0 0 100 Reservation request sent 200 Parameter sent Reservation succeded DMAC transfers Page 3 DMAC wrapper requests Page 3 P2 P3 Acc computes GAM passes parameter Core DMAC wrapper requests Page 2 P1 Acc wrapper partitions task ACC DMAC transfers Page 2 Acc done GAM passes done signal 11.91 us P1 P2 400 300 Page 0 Page 1 translated translated P3 500 Page 2 translated 600 Page 3 translated Task done 700 us Acc freed ARC Phase-2 Area Breakdown Slice Logic Utilization Number of Slice Registers: 45,283 out of 301,440: 15% Number of Slice LUTs: 40,749 out of 150,720: 27% AXI • Number used as logic: 32,505 out of • Slice 150,720: 21% Number used as Memory: 5,248 out of 58,400: 8% DRAM Controller DMAC wrapper AXILite Logic Distribution: Number of occupied Slices: 17,621 out of 37,680: 46% Number of LUT Flip Flop pairs used: 54,323 Ethernet DMA AXI Accelerator • • Microblaze (Linux) DMAC • Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused LUT: 13,574 out of 54,323: 24% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48% Microblaze (GAM) Ethernet DMA AXI Ethernet AXI ARC Phase-3 Goals First step toward BiN: Shared buffer Designing our customized modules Customized DMA-controller • Handles batch TLB misses Plug-n-play accelerator design • Making the interface general enough at least for a class of accelerators ARC Phase-3 Architecture A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6) Global accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the core AXI DRAM GAM Ethernet MDM UART Mutex CoreGAM Mail box 0 CoreIOMMU Mail box 1 Core INTC Timer AXILite ACC wrapper 0 ACC wrapper 1 ACC wrapper 2 ACC wrapper 3 ACC0 ACC1 ACC2 ACC3 IOMMU DMAC0 DMAC1 DMAC2 DMAC3 Buffer0 AXI_B3 Buffer1 AXI_B2 Buffer2 AXI_B1 Buffer3 AXI_B0 Bus master AXI Bus AXILite Bus Bus slave FSL AXIStream Performance and Power Results Benchmarking kernel: for i = 0...4096 y (i) x(i) x(i) 2 x(i) 3 x(i) 4 x(i) 5 x(i) 6 x(i) 7 x(i) 8 x(i) 9 Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,802 2 8,050,786X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562 80 2,069,261X 10,061 65 7,947X 852,163 72 1X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU Impact of Communication & Computation Overlapping IOMMU requests P0 DMAC D-0 Pages 0-4 Acc wrapper ACC partitions task GAM GAM passes GAM reserves parameter Acc Pages 0 translated Core 0 100 200 300 P1 D-1 P1 R1 W3 P0 R0 W2 500 Pages 2 translated 600 GAM reserves Acc P2 R0 W2 P3 R1 W3 700 Pages 4 translated 800 900 Pages 5 Pages 3 translated translated 1000 1100 1200 Pages 6 translated 1300 1400 Pages 7 translated 1500 1700 1800 1900 IOMMU requests Pages 5-9 Acc wrapper partitions task 200 300 2100 2200 DMAC transfers Pages 6-9 Acc computation GAM passes done signal GAM passes parameter Reservation Parameter sent request sent Reservation succeded 2000 19% Task done Pages 0-4 translated 100 1600 Pipelined Communication & Computation DMAC transfers Pages 0-4 Core 0 P7 3-D P6 2-D Acc freed IOMMU requests Pages 0-4 ACC GAM P5 3-D us Reservation Parameter sent request sent Reservation succeded DMAC P3 D-1 GAM passes done signal Pages 1 translated 400 P4 2-D P2 D-0 400 Pages 5-9 translated 500 600 700 800 900 1000 1100 1200 No pipeline 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 us Acc freed Task done Overhead of Buffer Sharing: Bank Access Contention (1) IOMMU requests P0 DMAC D-0 Pages 0-4 Acc wrapper ACC partitions task GAM GAM passes GAM reserves parameter Acc Pages 0 translated Core 0 100 200 300 P1 D-1 Reservation Parameter sent request sent Reservation succeded P3 D-1 P1 R1 W3 P0 R0 W2 P5 3-D P7 3-D P6 2-D P2 R0 W2 P3 R1 W3 GAM passes done signal Pages 2 translated Pages 1 translated 400 P4 2-D P2 D-0 500 600 700 Pages 5 Pages 3 translated translated Pages 4 translated 800 900 1000 1100 1200 Pages 6 translated 1300 1400 us Pages 7 translated 1500 1600 The 4 logic buffers are allocated to 4 separate buffer banks 1700 1800 Pages 0 translated Core 0 100 200 300 Reservation Parameter sent request sent Reservation succeded P1 D-1 2100 2200 2100 2200 Task done 3.2% P3 D-1 P0 R0 W2 P0 R0 W2 P5 3-D P7 3-D P6 2-D P0 R0 W2 P0 R0 W2 GAM passes done signal Pages 2 Pages 4 translated translated Pages 1 translated 400 P4 2-D P2 D-0 2000 Acc freed Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time IOMMU requests P0 DMAC D-0 Pages 0-4 Acc wrapper ACC partitions task GAM GAM passes GAM reserves parameter Acc 1900 500 600 700 800 900 Pages 3 translated 1000 1100 Pages 5 translated Pages 6 translated 1200 1400 1300 us 1500 The 4 logic buffers are allocated to 1 buffer bank Pages 7 translated 1600 1700 1800 1900 Acc freed Task done 2000 Overhead of Buffer Sharing: Bank Access Contention (2) DMAC IOMMU requests Pages 0-4 IOMMU requests Pages 5-9 DMAC transfers Pages 0-4 Acc wrapper partitions task ACC GAM reserves GAM Acc DMAC transfers Pages 6-9 Acc computation GAM passes done signal GAM passes parameter Pages 0-4 translated Core 0 100 200 400 300 Pages 5-9 translated 500 Reservation Parameter sent request sent Reservation succeded 600 700 800 900 1000 1100 1200 1300 1400 us 1500 1600 1700 1800 1900 2000 2200 2100 Acc freed The 4 logic buffers are allocated to 4 separate buffer banks Task done 2.7% DMAC IOMMU requests Pages 0-4 ACC GAM reserves GAM Acc IOMMU requests Pages 5-9 DMAC transfers Pages 6-9 Acc wrapper partitions task DMAC transfers Pages 6-9 Acc computation GAM passes done signal GAM passes parameter Pages 0-4 translated Core 0 100 200 300 Reservation Parameter sent request sent Reservation succeded 400 Pages 5-9 translated 500 600 700 800 900 1000 1100 1200 1300 1400 us 1500 1600 The 4 logic buffers are allocated to 1 buffer bank 1700 1800 1900 2000 2100 2200 2300 Acc freed Task done Area Breakdown Slice Logic Utilization Number of Slice Registers: 105,969 out of 301,440: 35% Number of Slice LUTs: 93,755 out of 150,720: 62% • Number used as logic: 80,410 out of • Slice 150,720: 53% Number used as Memory: 7,406 out of 58,400: 12% Logic Distribution: Number of occupied Slices: 32,779 out of 37,680: 86% Number of LUT Flip Flop pairs used: 112,772 • Number with an unused Flip Flop: • • 25,037 out of 112,772: 22% Number with an unused LUT: 19,017 out of 112,772: 16% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60% AXI-DDR IOMMU DDR Controller Ethernet DMA DMAC0 AXILite Microblaze1 (GAM) Microblaze0 (Linux) DMAC1 DMAC2 DMAC3 Accelerator (Sum of 10 SQRTs) AXI-BUF0 BUF0CRTL Buffer Selectors AXIBUF2 AXI-BUF1 BUF1CRTL BUF2CRTL AXI-BUF3 BUF3CRTL Ethernet Phase-4 ARC Goals Finding bottlenecks and system enhancement Communication bottleneck Crossbar design instead of AXI-bus Speed-up AXI non-burst implementation Accelerator Memory System Design Crossbar OC core AXI buses • Main memory • Shared buffer banks Buffer bank3 # of buffer banks can be large want to keep AXI bus size Hierarchical DMACs and buses to DDR LCA2 Data transfer between Buffer bank2 LCA3 Hierarchical DMACs Buffer bank1 DMAC3 Main AXI bus Passed on-board test DMAC1 • will not affect working LCAs DMAC2 now support partial configuration IOMMU LCA1 In addition to previously proposed Buffer bank4 LCA4 Buffer bank9 Select-bit Receiver GAM 24 Crossbar Results 25