Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation Amin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke University of Michigan, Ann Arbor HPCA-17 February 16, 2011 University of Michigan Electrical Engineering and Computer Science Matching Power Consumption and Utilization More than 50% of all computers Large SRAM structures limit the Min Vdd [Webber et. al.] More than 80% of times idle [Roth et. al.] DVS to improve battery life Logic cells can operate close to Vth Core i7 achieves 37% power reduction in idle state. 2 University of Michigan Electrical Engineering and Computer Science Bit-Error-Rate for an SRAM Cell Extremely fast growth in failure rate with decreasing Vdd 3 University of Michigan Electrical Engineering and Computer Science Our Goal Enabling DVS to push core’s Vdd down to o o Ultra low voltage region ( < 650mV ) While preserving correct functionality of on-chip caches Proposing a highly flexible and FT cache architecture that can efficiently tolerate these SRAM failures Minimizing our overheads in highpower mode 4 University of Michigan Electrical Engineering and Computer Science Archipelago (AP) data chunk Thisautonomous particular cache has By forming islands, a6 single AP only saves out offunctional 8 lines. line. 1 2 Island 1 3 4 Island 2 5 6 sacrificial line 7 8 sacrificial line University of Michigan Electrical Engineering and Computer Science Baseline AP Architecture Two lines have collision, if they have at least one faulty in Addedchunk modules: Fault map address Sacrificial line Memory Mapand 15 are collision + Memory the position (10 free) map Datasame line + Fault map There should be no collision between lines within a group G3 Input Address + MUXing layer [Group 3 (G3) contains lines 4, 10, and 15] First Bank Second Bank S Fault Map MUXing layer G3 - - Functional Block 6 Two type of lines: + data line + sacrificial line University of Michigan Electrical Engineering and Computer Science AP with Relaxed Group Formation Sacrificial lines do not contribute to the effective capacity o We want to minimize the total number of groups Second Bank First Bank S S First Bank Second Bank S 7 University of Michigan Electrical Engineering and Computer Science Semi-Sacrificial Lines First Bank Semi-sacrificial line guarantees the parallel access In contrast to a sacrificial line, it also contributes to the effective cache capacity Sacrificial line MUXing Layer Second Bank Semi-sacrificial line 8 University of Michigan Electrical Engineering and Computer Science AP with Semi-Sacrificial Lines Memory Map Input Address G3 First Bank Second Bank S semisacrificial line way0 way1 way0 way1 Fault Map MUXing layer G3 Functional Block 9 University of Michigan Electrical Engineering and Computer Science AP Configuration We model the problem as a graph: o o Each node is a line of the cache. Edge when there is no collision between nodes A collision free group forms a clique o Group formation Finding the cliques To maximize the number of functional lines, we need to minimize the number of groups. o minimum clique cover (MCC). 10 University of Michigan Electrical Engineering and Computer Science AP Configuration Example First Bank Second Bank 1 2 3 4 5 G1(1) G2(1) G2(S) G1(2) G2(2) 6 7 8 9 10 D G2(3) G1(3) G1(S) G2(4) 10 1 Island or Group 2 7 9 2 4 Island or Group 1 8 5 6 11 3 Disabled University of Michigan Electrical Engineering and Computer Science Operation Modes High power mode (AP is turned off) There is no non-functional lines in this case Clock gating to reduce dynamic power of SRAM structures Low power mode o During the boot time in low-power mode BIST scans cache for potential faulty cells Processor switches back to high power mode Forms groups and configure the HW 12 University of Michigan Electrical Engineering and Computer Science Evaluation Methodology Performance [DEC Alpha 21364] o SimAlpha that is based on SimpleScalar OoO [SPEC2K] Delay, power and area o o o o Wattch and hot-leakage for power of processor Artisan memory-compiler for our SRAM structures CACTI for baseline on-chip caches (64KB, 2MB) Synopsys design-compiler and power-compiler for Miscellaneous logic (e.g. bypass MUXes and comparators) Given set of cache parameters (e.g. Vdd) o o Monte Carlo (with 1000 iterations) using our modified MCC Determining disabled portion of caches (for 99% yield) 13 University of Michigan Electrical Engineering and Computer Science Minimum Achievable Vdd 14 University of Michigan Electrical Engineering and Computer Science Overheads Overheads for L1 and L2 caches o 10T used to protect the fault map, tag array, and memory map fault map (10T) miscellaneous logic memory map (10T) tag overhead (10T) 14 Percentage of Overhead 12 10 High Power Mode 8 6 4 2 0 L1 area L2 area L1 leakage power 15 L2 leakage power L1 dynamic power L2 dynamic power University of Michigan Electrical Engineering and Computer Science Performance Loss One extra cycle latency for L1 and 2 cycles for L2 16 University of Michigan Electrical Engineering and Computer Science Summary of Benefits Larger leakage power savings for deeper technology nodes 17 University of Michigan Electrical Engineering and Computer Science Comparison with Alternative Methods 100 10T Recently Proposed Cache Area Overhead (%) 66% area overhead Conventional ZC ECC-2 10 SECDED Row Red AP BF Disabled: 25% Disabled: 9% 1 0.5 1 1.5 2 2.5 3 Power (at minimum Vdd) Normalized to Archipelago 18 University of Michigan Electrical Engineering and Computer Science Conclusion DVS is widely used to deal with high power dissipation o We proposed a highly flexible cache architecture o Minimum achievable voltage is bounded by SRAM structures To tolerate failures when operating in near-threshold region Using our approach o o o Vdd of processor can be reduced to 375mV 79% dynamic power saving and 51% leakage power saving < 10% area overhead and performance overheads 19 University of Michigan Electrical Engineering and Computer Science Thank You http://cccp.eecs.umich.edu 20 University of Michigan Electrical Engineering and Computer Science