presentation - ParaDIME project

advertisement
Combining Error Detection and Transactional
Memory for Energy-Efficient Computing below
Safe Operation Margin
Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff,
Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer
PDP2014, Turin, Italy
13 February 2014
This project and the research leading to these results has received funding from the European Community's
Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693
Dark Silicon Phenomenon
Number of transistors can be increased.
In order to stay within a chip’s power
budget, some must remain “dark”.
One solution: Downscale the voltage.
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
2
POWER
How about Reliability?
PERFORMANCE
When the Vdd is reduced, the error rate
increases exponentially [1].
Our goal is:
Investigating the edge cases on voltage reduction
while the error recovery still leads to a reduced
energy consumption.
[1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual
IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
3
RELIABILITY
Agenda / Overview
Motivation
Experiment: Scaling Vdd in a Real System
Basics of Reliability
Error Recovery with TM
Error Detection Schemes
Analysis
Conclusion
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
4
Reducing Vdd in a Real System
Cumulative number of failures
Cumulative number of uncorrectable errors
# Errors
Cumulative number of signals
Cumulative number of correctable errors
AMD FX-6100
4500
Errors
are in instruction cache (37%), execution unit
6-core
CPU
4000
(61%) and others (less than 2%).
3500
CPU-heavy
execution
3000
Every 10
2500seconds reduce Vdd by 12.5mV
Monitor2000
1500
Supply
The system encounters errors which can not
be Voltage
corrected by MCA even only after 10% reduction in Vdd
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
5
0.9875
1.0000
1.0125
1.0250
1.0375
1.0500
1.0625
1.0750
1.0875
1.1000
1.1125
1.1250
1.1375
1.1500
Incorrect Result
1000
System
Crash
500
Machine
0 Check Architecture
Basics of Reliability
Error Recovery
Error Detection
Global Checkpointing
Replication
Coordinated Local
Checkpointing
Assertions/Invariants
Un-coordinated Local
Checkpointing
Symptom-Based
Transactional Memory can provide a
lightweight Coordinated Local
Checkpoitning [2]
Encoded Processing
[2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
6
TM provides checkpointing/rollback
Pn
P4
P3
Processor 1
P2
Synchronize
checkpoints
Checkpoint
(Log Area)
Checkpoint
(Log Area)
Checkpoint
(Log Area)
Checkpoint
Checkpoint
(Log Area)
(Log Area)
TM write-sets log the tentative Data-Versioning provides a synchronization
mechanism between checkpoints.
memory updates.
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
7
Error Detection Schemes - Replication
Execute instruction streams multiple times
Compare the results of executions
Less comparison with TM.
Dual/Triple Modular Redundancy
+ High Error Detection Rate
- High Energy Overhead
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
8
Error Detection Schemes-Assertions/Invariants
Assertions: Conditions referring to the
current and previous state of the program.
Check the state
Adding manually or automatic
TM facilitates inserting invariants
Ex:
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
9
Error Detection Schemes - Symptoms
Monitor program executions to inspect if
there is a symptom of hardware faults.
Symptoms:
Mispredictions in high confidence branches,
high OS activity,
fatal traps (e.g. undefined instruction code)
Reliability at a low cost
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
10
Error Detection Schemes- Encoded Processing
Apply software coding (ECC-like)
techniques
The redundancy is added by applying
arithmetic codes to the values.
Arithmetic codes: AN, ANBDmem etc.
With TM, the validation of a code word
can be deferred until a TX commits.
Ex:
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
11
Comparing Error Detection Schemes
4.0
Energy Spent
3.0
2.0
1.0
0.0
Base
DMR
TMR
Symptom-Based
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
Encoding
12
Invariants
Analysis
Gem5 full system simulator
1GHz in-order cores
4 cores
X86 ISA
64KB L1 data and instruction caches
Unified 2MB L2 cache
SPLASH2 benchmark suite.
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
13
Energy Analysis
Error Detection Rate
TX
size
Fault
Injection
Recovery
Overhead
Vdd
E ≈ C x Vdd 2
Error-free
Overhead
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
14
Energy Reduction
4
TX_Size = 1000 instructions
3
2
1
0
2
1.8
1.6
1.4
1.2
1
Supply Voltage (V)
Base
SymptomBased
Encoding
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
15
Duplication
Invariants
Triplication
0.8
Reliability of the System
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
16
Conclusion
The energy consumption of CPUs can be
reduced if we have efficient hardware
support for Transactional Memory and for
Error Detection.
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
17
Future Work: Combining DMR and Symptoms
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
18
Thanks!
Combining Error Detection and TM for Energy-Efficient Computing
below Safe Operation Margin
19
Download