Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer PDP2014, Turin, Italy 13 February 2014 This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693 Dark Silicon Phenomenon Number of transistors can be increased. In order to stay within a chip’s power budget, some must remain “dark”. One solution: Downscale the voltage. Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 2 POWER How about Reliability? PERFORMANCE When the Vdd is reduced, the error rate increases exponentially [1]. Our goal is: Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption. [1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003 Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 3 RELIABILITY Agenda / Overview Motivation Experiment: Scaling Vdd in a Real System Basics of Reliability Error Recovery with TM Error Detection Schemes Analysis Conclusion Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 4 Reducing Vdd in a Real System Cumulative number of failures Cumulative number of uncorrectable errors # Errors Cumulative number of signals Cumulative number of correctable errors AMD FX-6100 4500 Errors are in instruction cache (37%), execution unit 6-core CPU 4000 (61%) and others (less than 2%). 3500 CPU-heavy execution 3000 Every 10 2500seconds reduce Vdd by 12.5mV Monitor2000 1500 Supply The system encounters errors which can not be Voltage corrected by MCA even only after 10% reduction in Vdd Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 5 0.9875 1.0000 1.0125 1.0250 1.0375 1.0500 1.0625 1.0750 1.0875 1.1000 1.1125 1.1250 1.1375 1.1500 Incorrect Result 1000 System Crash 500 Machine 0 Check Architecture Basics of Reliability Error Recovery Error Detection Global Checkpointing Replication Coordinated Local Checkpointing Assertions/Invariants Un-coordinated Local Checkpointing Symptom-Based Transactional Memory can provide a lightweight Coordinated Local Checkpoitning [2] Encoded Processing [2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013 Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 6 TM provides checkpointing/rollback Pn P4 P3 Processor 1 P2 Synchronize checkpoints Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint Checkpoint (Log Area) (Log Area) TM write-sets log the tentative Data-Versioning provides a synchronization mechanism between checkpoints. memory updates. Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 7 Error Detection Schemes - Replication Execute instruction streams multiple times Compare the results of executions Less comparison with TM. Dual/Triple Modular Redundancy + High Error Detection Rate - High Energy Overhead Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 8 Error Detection Schemes-Assertions/Invariants Assertions: Conditions referring to the current and previous state of the program. Check the state Adding manually or automatic TM facilitates inserting invariants Ex: Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 9 Error Detection Schemes - Symptoms Monitor program executions to inspect if there is a symptom of hardware faults. Symptoms: Mispredictions in high confidence branches, high OS activity, fatal traps (e.g. undefined instruction code) Reliability at a low cost Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 10 Error Detection Schemes- Encoded Processing Apply software coding (ECC-like) techniques The redundancy is added by applying arithmetic codes to the values. Arithmetic codes: AN, ANBDmem etc. With TM, the validation of a code word can be deferred until a TX commits. Ex: Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 11 Comparing Error Detection Schemes 4.0 Energy Spent 3.0 2.0 1.0 0.0 Base DMR TMR Symptom-Based Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Encoding 12 Invariants Analysis Gem5 full system simulator 1GHz in-order cores 4 cores X86 ISA 64KB L1 data and instruction caches Unified 2MB L2 cache SPLASH2 benchmark suite. Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 13 Energy Analysis Error Detection Rate TX size Fault Injection Recovery Overhead Vdd E ≈ C x Vdd 2 Error-free Overhead Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 14 Energy Reduction 4 TX_Size = 1000 instructions 3 2 1 0 2 1.8 1.6 1.4 1.2 1 Supply Voltage (V) Base SymptomBased Encoding Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 15 Duplication Invariants Triplication 0.8 Reliability of the System Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 16 Conclusion The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection. Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 17 Future Work: Combining DMR and Symptoms Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 18 Thanks! Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin 19