A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University Beyond Homogeneous Parallelism General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Energy Efficiency Flexibility Programmability Design Cost 2 Application-Specific Accelerator (ASIP, ASIC) Today’s SoC OMAP 4 SoC 3 Today’s SoC ARM Cores Audio DSP Video DSP Face Imaging GPU DMA USB System Bus USB DMA Secondary Bus Secondary Bus Tertiary Bus OMAP 4 SoC 4 SD Today’s SoC Apple A7 Other Blocks 61% CPU + L2$ + GPU 39% Harvard VLSI-ARCH Group SoC Tapeout 5 Today’s SoC CPU GPU/ DSP CPU Buses Mem InterAcc Acc Acc face Acc Acc Acc Acc Acc Acc 6 Future Accelerator-Centric Architectures Big Cores Small Cores Shared Resources Memory Interface GPU/DSP Sea of Fine-Grained Accelerators How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? 7 Flexibility Design Cost Programmability Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Aladdin Accelerator Design Parameters (e.g., # FU, mem. BW) Power/Area Accelerator Specific Datapath Private L1/ Scratchpad Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost 8 Future Accelerator-Centric Architecture Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators 9 Future Accelerator-Centric Architecture Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Aladdin can rapidly evaluate large design space of accelerator-centric architectures. 10 Aladdin Overview Optimization Phase C Code Acc Design Parameters Optimistic IR Initial DDDG Idealistic DDDG Dynamic Data Dependence Graph Resource Program (DDDG) Constrained DDDG Constrained DDDG Realization Phase 11 Performance Activity Power/Area Models Power/Area Aladdin Overview Optimization Phase C Code Optimistic IR Initial DDDG Idealistic DDDG Performance Acc Design Parameters Program Constrained DDDG Resource Constrained DDDG Realization Phase 12 Activity Power/Area Models Power/Area From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 13 From C to Design Space IR Dynamic Trace C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10. r0 = r0 + 1 //++i … 14 From C to Design Space Initial DDDG C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i++ 10. i++ 11. ld a 1. ld a 2. ld b 6. ld a 7. ld b 3. + 12. ld b 8. + 4. st c 13. + 9. st c 14. st c 15 From C to Design Space Idealistic DDDG C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 0. i=0 5. i++ 1. ld a 2. ld b 1. ld a 5. i++ 2. ld b 6. ld a 10. i++ 7. ld b 11. ld a 10. i++ 6. ld a 7. ld b 3. + 3. + 8. + 13. + 11. ld a 12. ld b 8. + 4. st c 4. st c 9. st c 14. st c 13. + 9. st c 14. st c 16 12. ld b From C to Design Space Optimization Phase: C->IR->DDDG • Include application-specific customization strategies. • Node-Level: – Bit-width Analysis – Strength Reduction – Tree-height Reduction • Loop-Level: – Remove dependences between loop index variables • Memory Optimization: – Memory-to-Register Conversion – Store-Load Forwarding – Store Buffer • Extensible – e.g. Model CAM accelerator by matching nodes in DDDG 17 From C to Design Space One Design Resource Activity Idealistic DDDG 0. i=0 1. ld a 5.i++ 2. ld b 6. ld a 7. ld b 0. i=0 15. i++ 10. i++ 1. ld a 11. ld a 12. ld b 16. ld a 17. ld b 2. ld b MEM MEM 3. + 8. + 13. + 18. + 3. + + 4. st c 9. st c 14. st c 19. st c 4. st c MEM + 5.i++ 6. ld a Acc Design Parameters: Memory BW <= 2 1 Adder 7. ld b MEM MEM 8. + + 9. st c MEM Cycle 18 From C to Design Space Another Design Resource Activity Idealistic DDDG 0. i=0 1. ld a 5.i++ 2. ld b 6. ld a 15. i++ 10. i++ 7. ld b 0. i=0 11. ld a 12. ld b 16. ld a 17. ld b 1. ld a + 5.i++ 2. ld b 6. ld a 7. ld b MEM MEM MEM MEM 3. + 8. + 13. + 18. + 3. + 8. + + + 4. st c 9. st c 14. st c 19. st c 4. st c 9. st c MEM MEM Acc Design Parameters: Memory BW <= 4 2 Adders 11. ld a 12. ld b 16. ld a 17. ld b MEM MEM MEM MEM 13. + 18. + + + 14. st c 19. st c MEM MEM Cycle 19 + + 15. i++ 10. i++ From C to Design Space Realization Phase: DDDG->Estimates • Constrain the DDDG with program and user-defined resource constraints • Program Constraints – Control Dependence – Memory Ambiguation • Resource Constraints – – – – Loop-level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, multipliers) 20 From C to Design Space Power-Performance per Design Power Acc Design Parameters: Memory BW <= 4 2 Adders Acc Design Parameters: Memory BW <= 2 1 Adder Cycle 21 From C to Design Space Design Space of an Algorithm Power Cycle 22 Aladdin Validation Aladdin C Code Power/Area Design Compiler Activity Verilog ModelSim 23 Performance Aladdin Validation Aladdin C Code Power/Area Design Compiler Activity RTL Designer Verilog HLS C Tuning Vivado HLS ModelSim 24 Performance Aladdin Validation 25 Aladdin Validation 26 Aladdin enables rapid design space exploration for accelerators. Aladdin C Code Power/Area Design Compiler Activity RTL Designer Verilog HLS C Tuning Vivado HLS ModelSim 27 Performance Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. MARSx86 Big Cores ... XIOSim Small Cores … Shared Cacti/Orion2 Resources GPGPUGPU Sim Memory DRAMSim2 Interface Sea of Fine-Grained Accelerators 28 Modeling Accelerators in a SoC-like Environment Acc Core Cache Memory 160 Acc Core block=16 block=32 140 Power (mW) 120 Cache With Memory Contention 100 80 60 40 Memory 20 0 29 0 0.5 1.0 1.5 2.0 Time (Million Cycles) 2.5 3.0 Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator • Architectures with 1000s of accelerators will be radically different; New design tools are needed. • Aladdin enables rapid design space exploration of future accelerator-centric platforms. • You can find Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin 30