Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California 1 University of Michigan Electrical Engineering and Computer Science Introduction: Memory Power • On-chip memories are a major contributor to system energy • Data caches ~16% in StrongARM [Unsal et. al, ‘01] Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing Software controlled scratch-pad, data/code reorganization + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive + Whole program information + Proactive – No dynamic adaptability – Conservative 2 University of Michigan Electrical Engineering and Computer Science Reducing Data Memory Power: Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing Software controlled scratch-pad, data/code reorganization + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive + Whole program information + Proactive ー No dynamic adaptability ー Conservative Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations 3 University of Michigan Electrical Engineering and Computer Science Data Caches: Tradeoffs Advantages + + + + Disadvantages – – – – Capture spatial/temporal locality Transparent to the programmer General than software scratch-pads Efficient lookups 4 Fixed replacement policy Set index no program locality Set-associativity has high overhead Activate multiple data/tag-array per access University of Michigan Electrical Engineering and Computer Science Traditional Cache Architecture tag data tag set offset lru tag data lru tag data lru tag data lru Replace =? =? =? =? 4:1 mux • Lookup Activate all ways on every access • Replacement Choose among all the ways 5 University of Michigan Electrical Engineering and Computer Science Partitioned Cache Architecture Ld/St Reg [Addr] tag data tag set offset lru tag data P0 [k-bitvector] lru tag P1 data [R/U] lru tag P2 data lru P3 Replace =? =? =? =? 4:1 mux • Advantages • Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions Improve performance by controlling replacement • Replacement Restricted to partitions specified in bit-vector Reduce cache access power by restricting number of accesses 6 University of Michigan Electrical Engineering and Computer Science Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] ld1/st1 ld3 ld5 for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] ld2/st2 } way-0 tag data ld1, st1, ld2, st2 y way-1 tag data ld4 ld6 way-2 tag data ld5, ld6 ld3, ld4 w1/w2 x ld1 [100], R ld5 [010], R ld3 [001], R • Reduce number of tag checks per iteration from 12 to 4 ! 7 University of Michigan Electrical Engineering and Computer Science Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics – Cache requirements Number of partitions per ld/st – Predict conflicts • Place loads/stores to different partitions – Satisfies its caching needs – Avoid conflicts, overlap if possible 8 University of Michigan Electrical Engineering and Computer Science Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the working-set to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 M B1 M B1 M B1 M • M has working-set size = 1 9 University of Michigan Electrical Engineering and Computer Science Cache Analysis: Estimating Number Of Partitions Avoid conflict/capacity misses for an instruction Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) 1 8 16 24 32 2 3 D=2 D=1 D=0 1 4 2 3 1 4 1 16 24 1 1 3 4 8 .76 8 .87 1 2 .98 16 1 32 1 24 1 1 32 1 Compute energy matrices in reality Pick most energy efficient configuration per instruction 10 University of Michigan Electrical Engineering and Computer Science Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D=1 M1 D=1 M2 D=1 M3 D=1 11 University of Michigan Electrical Engineering and Computer Science Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques) M4 D=1 M1 D=1 Clique 2 Clique 1 M2 D=1 M3 D=1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 12 University of Michigan Electrical Engineering and Computer Science Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: – 1-Kb to 32-Kb – 32-byte block size – 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling 13 University of Michigan Electrical Engineering and Computer Science Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part Average way accesses 7 6 5 4 3 2 1 0 1-K 2-K 4-K 8-K Cache size 16-K 32-K Average • 36% reduction on a 8-partition cache 14 University of Michigan Electrical Engineering and Computer Science 0 15 Average djpeg cjpeg unepic epic gsmdecode 4-part vs 4-way gsmencode pgpdecode pgpencode 2-part vs 2-way pegwitdec pegwitenc mpeg2enc mpeg2dec g721decode g721encode rawdaudio rawcaudio Percentage energy improvement Improvement in Fetch Energy 16-Kb cache 60 8-part vs 8-way 50 40 30 20 10 University of Michigan Electrical Engineering and Computer Science Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler – Avoid conflicts, eliminate redundancies • 24% energy savings for 4-Kb with 4-partitions • Extensions – Hybrid scratch-pad and caches – Disable selected tags convert them to scratch-pads – 35% additional savings in 4-Kb cache with 1 partition as SP 16 University of Michigan Electrical Engineering and Computer Science Thank You & Questions 17 University of Michigan Electrical Engineering and Computer Science Cache Analysis Step 1: Instruction Fusioning • Combine ld/st that accesses the same set of objects • Avoids coherence and duplication • Points-to analysis ld1/st1 for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) ld3 y[i + j] += *w1++ + x[i + j] ld5 M1 for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] ld4 ld2/st2 M2 ld6 } 18 University of Michigan Electrical Engineering and Computer Science Partition Assignment • Greedily place instructions based on its cache estimates • Overlap instructions if required • Compute number of partitions for overlapped instructions – Enumerate cliques within interference graph – Compute combined working-set of all cliques • Assign the R/U bit to control lookup M4 D=1 M1 D=1 Clique 2 Clique 1 M2 D=1 M3 D=1 19 University of Michigan Electrical Engineering and Computer Science Related Work • Direct addressed, cool caches [Unsal ’01, Asanovic ’01] – Tags maintained in registers that are addressed within loads/stores • Split temporal/spatial cache [Rivers ’96] – Hardware managed, two partitions • Column partitioning [Devdas ’00] – Individual ways can be configured as a scratch-pad – No load/store based partitioning • Region based caching [Tyson ’02] – Heap, stack, globals – More finer grained control and management • Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] – Reduce tag check power – Compromises on cycle time – Orthogonal to our technique 20 University of Michigan Electrical Engineering and Computer Science 0 21 Average djpeg cjpeg unepic epic gsmdecode gsmencode pgpdecode Annotated LD/STs pgpencode pegwitdec pegwitenc mpeg2enc mpeg2dec g721decode 12 g721encode rawdaudio rawcaudio Percentage instructions Code Size Overhead Extra MOV instructions 15% 16% 10 8 6 4 2 University of Michigan Electrical Engineering and Computer Science