SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys SoC Design • What to do when the performance of your main processor is insufficient? – Go multicore? • Application mapping difficult, resource utilisation unbalanced – Add hardwired accelerators? • Balanced but inflexible SoC Design • What to do when the performance of your main processor is insufficient? ASIPs: application-specific processors • Anything between general-purpose P and hardwired data-path • Deploys classic hardware tricks (parallelism and customized datapaths) while retaining programmability – Hardware efficiency with software programmability Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions Architectural Optimization Space ASIP architectural optimization space Parallelism Specialization Architectural Optimization Space Parallelism Instructionlevel parallelism (ILP) Orthogonal instruction set (VLIW) Encoded instruction set Datalevel parallelism Vector processing (SIMD) Tasklevel parallelism Multicore Multithreading Architectural Optimization Space Specialization App.-specific data types App.-specific instructions Pipeline Connectivity & storage matching application’s dataflow Distributed regs, sub-ranges Integer, fractional, floating-point, bits, complex, vector… Multiple mem’s, sub-ranges App.-spec. memory addressing App.-spec. data processing App.-spec. control processing Direct, indirect, postmodification, indexed, stack indirect… Any exotic operator Jumps, subroutines, interrupts, HW do-loops, residual control, predication… Single or multicycle Relative or absolute, address range, delay slots… IP Designer: ASIP Design and Programming Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions Synopsys - Full Spectrum Processor Technology Provider 32-bit ARC HS Processors High-Performance for Embedded Applications • Over 3100 DMIPS @ 1.6 GHz* ARC Floating Point Unit JTAG User Defined Extensions • HS Family products ARCv2 ISA / DSP MAC & SIMD Multiplier ALU Divider Late ALU 10-stage pipeline Optional Instruction Cache Data Cache RealTime Trace – HS34 CCM, HS36 CCM plus I&D cache – HS234, HS236 dual-core – HS434, HS436 quad-core • Configurable so each instance can be optimized for performance and power Memory Protection Unit Instruction CCM • 53 mW* of power; 0.12mm2 area in 28-nm process* Data CCM • Custom instructions enable integration of proprietary hardware *Worst case 28-nm silicon and conditions Pedestrian Detection and HOG • Pedestrian detection • Standard feature in luxury vehicles • Moving to mid-size and compact vehicles in the next 5-10 years, also due to legislation efforts • Implementation requirements • Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades) • Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG) Histogram Of Oriented Gradients Grey scale conversion Scale to Multiple Resolutions Scale to multiple resolutions Gradient computation Histogram computation per block Normalization of the histograms SVM per window position Non-max suppression Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames. Gradient Computation +1 +2 Apply Sobel operators: 0 0 −1 −2 +1 0 −1 +1 0 and +2 0 −2 +1 0 −1 −1 Histogram Of Oriented Gradients Grey scale conversion Histogram Computation Scale to multiple resolutions Gradient computation Histogram computation per block Normalization of the histograms The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients. Normalization of the Histograms (1) L2 Normalization (2) clipping (saturation) Support Vector Machine Linear classification of histograms for every 64x128 windows position. SVM per window position Non-Max Suppression Non-max suppression Cluster multi-scale dense scan of detection windows and select unique (3) L2 Normalization HOG Functional Validation on ARC HS (640 x 480 pixels) 1 Grey scale conversion Rescaling Gradient Normalization Histogram SVM Non-max suppression Dedicated Streaming Interconnect (FIFOs) D ASIP1 ASIP2 … D ASIPn AXI local interconnect HS Subs. ctrl DCCM DMA, Sync & I/O • OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame Profiling (640 x 480 pixels, at 30 FPS) ARC HS G cycles % # ARC HS equivalent Grey scale conversion 0.1 0.2% 0.07 Scale to multiple resolutions 1.6 2.3% 1.0 Gradient computation 17.3 26% 10.8 Histogram computation per block 31.9 47% 20.0 1.2 1.8% 0.8 SVM per window position 15.7 23% 9.8 Non-max suppression 0.004 0.01% 0.002 Normalization of the histograms Task Assignment #2 Grey scale conversion Rescaling Gradient 2 Normalization Histogram Dedicated Streaming Interconnect (FIFOs) D ASIP1 D ASIP2 D ASIP4 AXI local interconnect HS Subs. ctrl DCCM L3 Ext. DRAM DMA, Sync & I/O SVM Non-max suppression ASIP Example: HISTOGRAM • • • • • • Vector-slot next to existing scalar instructions (VLIW) 16x(8/16)-bit vector register files 16x8-bit SRAM interface 16x8-bit FIFO interfaces Vector arithmetic instructions Special registers and instructions to compute histograms 4x size increase & 200x speedup (relative to RISC template) Implemented in less than 1 week Task Assignment #3 Grey scale conversion Rescaling Gradient 3 Normalization Histogram Dedicated Streaming Interconnect (FIFOs) D ASIP1 D ASIP2 D ASIP3 D ASIP4 AXI local interconnect HS Subs. ctrl DCCM L3 Ext. DRAM DMA, Sync & I/O SVM Non-max suppression Task Assignment #4 Grey scale conversion Rescaling Gradient 4 Normalization Histogram Dedicated Streaming Interconnect (FIFOs) D ASIP1’ D ASIP2 D ASIP3 D ASIP4 AXI local interconnect HS Subs. ctrl DCCM L3 Ext. DRAM DMA, Sync & I/O SVM Non-max suppression Task Assignment #4 Grey scale conversion Rescaling Gradient 4’ Normalization Histogram Dedicated Streaming Interconnect (FIFOs) D ASIP1’ D ASIP2 D ASIP3 D ASIP4 AXI local interconnect HS DCCM L2 SRAM L3 Ext. DRAM DMA, Sync & I/O SVM Non-max suppression Comparison Platform configuration #HS (MHz) #ASIP (MHz) ARC Functions ASIP Functions All None HS 1 ~40 0 HS + ASIPs 2 2 (1600) 2.5 (500) Greyscale Rescaling Normalization Non-max suppr. Display Gradient Histogram SVM HS + ASIPs 3 1 (1600) 3.5 (500) Greyscale Rescaling Non-max suppr. Display Gradient Histogram Normalization SVM 1 (500) 4 (500) Greyscale Non-max suppr. Display Rescaling Gradient Histogram Normalization SVM HS + ASIPs 4 Final Results • 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM • 30 frames/second at 500 MHz • Functionally identical to OpenCV reference • TSMC 28nm • ASIP gate count: 330k gates • ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD usage • Power/performance/area via ASIPs • Scaling due to multi-core, specialization and SIMD usage • Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture 23 Scenario: Need for Flexible FEC Core • Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi • Instead of duplication of FEC cores, need for reconfigurable architecture at minimum power and area DVB-X? .11n LDPC-A LDPC-C .11n Vit UMTS Turbo-B 3GPP-LTE .16e turbo-A LDPC-D FlexFEC (turbo/LDPC/Vit) Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6 ILP: 2 FU (scalar+vector unit) ILP: 6 FU (1 scalar+5 vector units) No duplication for arithmetic functionality For exploiting ILP to increase throughput 2 FUs for local memory access Fast Area/Performance Trade-off (40nm logical synthesis Processor only) 0.189 sqmm 0.177 sqmm 100 90 80 cycle count 70 60 ldpc - layer 6 50 ldpc - layer 8 40 turbo - beta 30 turbo - output 20 10 0 2 3 4 5 Total number of processor functional units 6 Architectural Exploration FU Utilization: 2 5 100.0 Vector slot separated in different FUs without overlapping functionality 90.0 80.0 70.0 60.0 50.0 scalar 40.0 vector Local memory access congestion 30.0 100.0 20.0 90.0 10.0 80.0 0.0 layer6 layer7 layer8 alpha beta output 70.0 scalar 60.0 vector alu 50.0 vector spec 40.0 vector vmem 30.0 vector bg vmem 20.0 10.0 0.0 layer6 layer7 layer8 alpha beta output Architectural Exploration More Balanced FU Utilization: 5 6 90.0 80.0 70.0 60.0 scalar vector alu 50.0 vector spec vector vmem 40.0 vector vmem2 vector bg vmem 30.0 20.0 10.0 0.0 ldpc - layer6 ldpc - layer7 ldpc - layer8 turbo - alpha turbo - beta turbo - output Highly Efficient C-compilation Vast Majority of 6 FU Used Latest IP Available from IMEC Blox-LDPC ASIP Instances available ad Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions Conclusion • ASIPs enable programmable accelerators • IP Designer enables efficient design and programming of ASIPs • “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators • ASIPs enable balanced multicore SoC architectures