ECE 260C – VLSI Advanced Topics Term paper presentation Low Power Processor Architectures and Software Optimization Techniques May 27, 2014 Keyuan Huang Ngoc Luong Motivation Global Mobile Devices and Connections Growth ~10 billion mobile devices in 2018 Moore’s law is slowing down Power dissipation per gate remains unchanged How to reduce power? Circuit level optimizations (DVFS, power gating, clock gating) Microarchitecture optimization techniques Compiler optimization techniques Trend: More innovations on architectural and software techniques to optimize power consumption Low Power Architectures Overview Asynchronous Processors Eliminate clock and use handshake protocol Save clock power but higher area Ex: SNAP, ARM996HS, SUN Sproull. Application Specific Instruction Set Processors Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic Combine basic instructions with custom instruction based on application Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI Reconfigurable Instruction Set Processors Combine fixed core with reconfigurable logic (FPGA) Low NRE cost vs ASIP Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO No Instruction Set Computer Build custom datapath based on application code Compiler has low-level control of hardware resource Ex: WISHBONE system. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications Broader range of applications compared to accelerator Reconfigurable via patching algorithm Automatically synthesizable by toolchain from C source code Energy consumption is reduced up to 16x for functions and 2.1x for whole application Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. C-core organization Data path (FU, mux, register) Control unit (state machine) Cache interface (ld, st) Scan chain (CPU interface) Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. C-core execution Compiler insert stubs into code compatible with c-core Choose between c-core and CPU and use c-core if available If no c-core available, use GP processor, else use c-core to execute C-core raises exception when finish executing and return the value to CPU Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Patching support Basic block mapping Control flow mapping Register mapping Patch generation Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Patching Example Configurable constants Generalized single-cycle datapath operators Control flow changes Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Results 18 fully placed-and routed c-cores vs MIPS 3.3x – 16x energy efficiency improvement Reduce system energy consumption by upto 47% Reduce energy-delay by up to 55% at the full application level Even higher energy saving without patching support Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. Software Optimization Technique Memory system uses power (1/10 to ¼) in portable computers System bus switching activity controlled by software ALU and FPU data paths needs good scheduling to avoid pipeline stalls Control logic and clock reduce by using shortest possible program to do the computation K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics General categories of software optimization Minimizing memory accesses Minimize accesses needed by algorithm Minimize total memory size needed by algorithm Use multiple-word parallel loads, not single word loads Optimal selection and sequencing of machine instruction Instruction packing Minimizing circuit state effect Operand swapping K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran ,Michael Chu,Scott Mahlke Basic Idea: Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing Software controlled scratch-pad, data/code reorganization + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information + Whole program information + Proactive ー Conservative Global program knowledge Proactive optimizations Efficient execution Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) Traditional Cache Architecture tag data tag set offset lru tag data lru tag data lru tag data Disadvantages lru Replace =? =? =? =? 4:1 mux • Lookup Activate all ways on every access • Replacement Choose among all the ways – – – – Fixed replacement policy Set index no program locality Set-associativity has high overhead Activate multiple data/tag-array per access Partitioned Cache Architecture Ld/St Reg [Addr] tag data tag set offset lru tag data P0 [k-bitvector] lru tag P1 data Advantages [R/U] lru tag P2 data lru P3 Replace =? =? =? =? + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses 4:1 mux • Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement Restricted to partitions specified in bit-vector Partitioned Caches: Example ld1/st1 for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] ld2/st2 } part-0 tag ld3 ld5 ld4 ld6 (b) Fused load/store instructions (a) Annotated code segment data part-1 tag data part-3 tag data ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x ld1 [100], R ld5 [010], R ld3 [001], R (d) Actual cache partition assignment for each instuction (c) Trace consisting of array references, cache blocks, and load/stores from the example Compiler Controlled Data Partitioning Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics Cache requirements Number of partitions per ld/st Predict conflicts Place loads/stores to different partitions Satisfies its caching needs Avoid conflicts, overlap if possible Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the reuse distance to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 M B1 M B2 M • M has reuse distance = 1 B2 M Cache Analysis: Estimating Number Of Partitions Avoid conflict/capacity misses for an instruction Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) 1 8 16 24 32 2 3 D=2 D=1 D=0 1 4 2 3 1 4 1 16 24 1 1 32 3 4 8 .76 8 .87 1 2 16 1 24 1 1 32 Compute energy matrices in reality Pick most energy efficient configuration per instruction .98 1 1 Cache Analysis: Computing Interferences Avoid conflicts among temporally co-located references Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D=1 M1 D=1 M2 D=1 M3 D=1 Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique part-0 tag ld1, st1, ld2, st2 y For each clique, new D Σ D of each node M1 D=1 Clique 2 Clique 1 M2 D=1 M3 D=1 tag data part-2 tag data ld5, ld6 ld3, ld4 w1/w2 x ld1 [100], R ld5 [010], R ld3 [001], R Combined D for all overlaps Max (All cliques) M4 D=1 data part-1 Actual cache partition assignment for each instruction Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 Experimental Setup Trimaran compiler and simulator infrastructure ARM9 processor model Cache configurations: 1-Kb to 32-Kb 32-byte block size 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache Mediabench suite CACTI for cache energy modeling Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part Average way accesses 7 6 5 4 3 2 1 0 1-K 2-K 4-K 8-K Cache size 16-K 32-K Average • 25%,30%,36% access reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) Improvement in Fetch Energy 16-Kb cache 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 20 Average djpeg cjpeg unepic epic gsmdecode gsmencode pgpdecode pgpencode pegwitdec pegwitenc mpeg2enc mpeg2dec g721decode g721encode 0 rawdaudio 10 rawcaudio Percentage energy improvement 60 • 8%,16%,25% energy reduction on a 2-,4-,8-partition cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) Summary Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the compiler Avoid conflicts, eliminate redundancies Achieve a higher performance and a lower power consumption Future Works Hybrid scratch-pad and caches Develop advance toolchain for newer technology node such as 28nm Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP Reference 1. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). 2. Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. 3. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 4. K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics