th
•
•
•
Characterization of the Protection Overhead
•
Resilience Optimization with Cross-Layer Techniques
•
Cross Layer Protection in an Accelerator
•
2
•
• Cosmic rays
• Alpha Particles
Soft-Error Problem
•
• Most errors in sequential logic (flip-flop, SRAM)
• Fewer errors in combinational logic (gates)
Particle strike and resulting current pulse
Source: R. Baumann et al, IEEE DTC, 2005 [1]
•
• Electrical, temporal
• Logical
• Architectural
• OS, software
• Soft-errors caused by particles; more important in storage cells
• Most are masked in HW-Arch-SW hierarchy, many remain
3
Soft-Error Problem
•
– Multi-gate transistors and Silicon-on-
Insulator technologies
• smaller sensitive area
•
– Lower voltage, smaller critical charge
– More multi-bit errors
– Higher altitude
– Larger process variation
– Increase with chip and computation complexity
Decreasing cell vulnerability
Source: N. Seifert et al, IEEE ToNS, 2012 [2]
Particle strike causing multi-bit errors
Source: N. George et al, DSN, 2010 [3]
4
Soft-Error Problem
•
Combinational Sequential SRAM
Source: Wikipedia, 2014 [10]
Redundancy*^,
Arithmetic Codes (ALU/FPU)
Parity*, ECC*^,
Upsizing, Hardening
•
– Instruction Replication^ / Redundant Multi-threading (RMT)^
– Control/Data Flow Checking (CFC/DFC)
– Algorithm-Based Fault Tolerance (ABFT)
– Other SW and HW, less efficient
* - commonly used
^ - high overhead
bold – analyzed in this work
5
•
•
Architecture-
Level
Cross-Layer
Protection
Existing
Protection
Overhead
Characterization
Accelerator (FFT)
Cross-Layer
Protection
6
Architecture-Level
Cross-Layer
Protection
Existing Protection
Overhead
Characterization
IEEE Micro, 2013
Accelerator (FFT)
Cross-Layer
Protection
7
Characterization of the Protection Overhead
Motivation
– High overhead of redundancy, the comprehensive solution
• How do other techniques scale for better coverage?
– Important to understand the overhead breakdown
• Determine efficiency of common techniques
• Illustrate optimal designs currently possible
• Serve as baseline to facilitate further exploration
Contributions
– Develop resilience techniques for example core and place in the exploration framework
– Provide area, power and delay overhead analysis at component and core levels
– No previous work to provide such comprehensive resilience analysis in a single core
8
Parity
ECC
Considered
Protection
Techniques
More common
Characterization of the Protection Overhead
Redundancy
Less common
Hardening
Arithmetic Codes
(parity prediction, residue codes)
Instruction Replication
Platform Choice
Case Study:
OpenRISC [4]
Throughput and mobile designs
Scalar, In-Order (IO)
5-Stage Pipeline
4kB Inst & Data
Caches
Around 1300 flip-flops
Protection
Characterization
Implementation
Developed RTL code
Literature estimates
(Hardening)
Analysis
Area, power, delay via synthesis
Integrated into Svalinn framework
Estimation for other cores
9
Characterization of the Protection Overhead
Hardening
Improves flip-flop immunity by adding redundant transistors to hold the state on an upset or cancel its effect
Codes (parity, ECC, arithmetic)
• Parity - adds a single check bit to a bit word to verify correctness on access
• ECC - adds multiple check bits to a bit word to detect 2 errors and correct 1
• Arithmetic – parity and residue codes, predict output properties based on input
Redundancy
Replicates component and compares results from the two copies to check for errors
10
Characterization of the Protection Overhead
Memory Units Pipeline Stages
Component Area, Power, Delay
[normalized to highest for each category]
Core Area by Cell Type
• Memory, pipeline and arithmetic units vary in area, delay and power
• Depending on resilience solution, protection overhead is affected by:
– number of latches (hardening, parity, ECC)
– component size (redundancy)
11
Memory Units
Characterization of the Protection Overhead
Pipeline Stages
Protection Area Breakdown
[normalized to highest]
Area of Protected Components
[normalized to highest]
• ECC (encoders/decoders) too area-expensive for non-memory units, Redundancy
(copy) too expensive for all units
• Code encoders/decoders and comparator checkers are significant overhead contributors, relatively large for a small core
• Hardening , parity and function-specific (arithmetic codes) solutions remain most efficient for the protection of non-memory components
12
Characterization of the Protection Overhead
Varied coverage, no major performance loss
Performance traded for area
Area of Protected Core
[normalized to original, unprotected]
• Most common designs use ECC for memory and redundancy for logic (bar 2)
• Overhead of logic protection be further reduced by hardening (bar 4) and even more by less robust parity (bar 6)
• Arithmetic codes provide potential for more area optimization (bars 3, 5, 7) Ultimately, area can be significantly reduced at the cost of performance with instruction replication
13
Characterization of the Protection Overhead
• We implemented a range of common protection techniques for comparative analysis in a single core, with a breakdown into components
• We illustrate the sources of overheads including component size, number of flip-flops as well as checker, encoder/decoder size and additional check bit storage
• We show how more optimal core-level protection is achieved by applying techniques according to their best fit, with relative gains illustrated
14
Existing Protection
Overhead
Characterization
Architecture-Level
Cross-Layer
Protection
VLSI-TSA, 2014
In submission, 2015
Accelerator (FFT)
Cross-Layer
Protection
15
Resilience Optimization with Cross-Layer Techniques
Motivation
– Common solutions for logic circuits too expensive in area
(redundancy), hardening and parity are viable options
– Need to explore alternative, architecture and software solutions
• Can these provide efficient coverage for relevant circuits?
– Vulnerability and coverage needs more accurate testing
Contributions
– Evaluate architecture-level Data-Flow Checking (DFC) and Algorithm-
Based Fault Tolerance (ABFT)
– Combine with hardware and more software techniques into crosslayer solutions for improved efficiency*
– Present results for two (simple and complex) architectures based on flip-flop-level fault injection*
*In collaboration with Stanford University
16
Resilience Optimization with Cross-Layer Techniques
Considered
Protection
Techniques
Hardware
Hardening/Parity*
Architecture
Data-Flow Checking
(DFC)
Software
Algorithm-Based Fault
Tolerance (ABFT)*
Platform Choice
Case Study:
1) Leon [5]
Scalar, In-Order (IO)
7-Stage Pipeline, SPARC ISA
4kB Inst & Data Caches
Around 1,200 flip-flops
Case Study:
2) Illinois Verilog Model (IVM) [6]
Superscalar, Out-of-Order (IO)
12-Stage Pipeline, Alpha ISA
8kB Inst & 32kB Data Caches
Around 14,000 flip-flops
Protection
Characterization
Implementation
Cell library (Hardening)*
Implement RTL (Parity, DFC)
Create compiler tool (DFC)
Develop Software (ABFT)*
Analysis
Area, power, delay via synthesis
Coverage via flip-flop fault injection*
Run in FPGA (Leon) or RTL
Simulation (IVM)*
PERFECT, SPEC Benchmarks
*In collaboration with Stanford University
17
Resilience Optimization with Cross-Layer Techniques
Valid transitions between basic blocks in the binary on branch instructions
Data flow check in the Writeback (Leon) or
Decode (IVM) stage via signature evaluation
• Data-flow checking (DFC) protects against errors in instruction-flow crucial to program execution, not easily masked
• Data-flow checking efficiently compresses a large amount of relevant pipeline state in a signature checked for every basic-block
• Original Hypothesis: DFC improves efficiency of protection by reducing need for complementary solutions (hardening and parity)
18
Resilience Optimization with Cross-Layer Techniques
Concept
• Uses algorithm properties that relate inputs to outputs to check result
• Some algorithms employ additional encoding/decoding for better detection (FFT)
Applications
• FFT1D Weak
• FFT1D Full
• FFT2D Weak
• FFT2D Full
• 2D Convolution*
• Motion Imaging*
• Wavelet Transform*
• Inner Product*
*In collaboration with Stanford University
19
Resilience Optimization with Cross-Layer Techniques
Flip-flop-Level Fault Injection*
• Determine vulnerability and detection via flip-flop-level fault injection
• Rank flip-flops according to vulnerability
• Allows designers to target error reduction via gradual protection of ranked flip-flops
Hardening, Parity*
• Hardening adds redundancy to flipflop cells for better immunity
• Parity adds a single check bit to a bit word to verify correctness on access, uses pipelining
Pipelined Parity
*In collaboration with Stanford University
20
Resilience Optimization with Cross-Layer Techniques
Pipeline Flush Recovery Unit (Parity) Recovery Unit (DFC)
• Takes advantage of built-in pipeline flush mechanism in the core
• Can recover only if error is detected before reaching memory stage
• Buffers processor state and register file for the duration of pipeline length or more
• Buffers pipeline state, register file and memory stores for (unbounded) basic block duration
Significant overhead in Leon
IVM can utilize Reorder Buffer (ROB) instead
21
Resilience Optimization with Cross-Layer Techniques
DFC Components
• Checker – compute run-time signature, compare against static version
• Recovery Controller – manage buffering and recovery
• Buffer - store pipeline state and memory requests from basic block (~15 entries max)
• Shadow register file - restore pipeline state
Add DFC + recovery
Add DFC + recovery
Overhead < 1%
(~50x larger core, no recovery buffer or shadow register file needed)
22
Resilience Optimization with Cross-Layer Techniques
Dynamic basic block structure
• Basic blocks are small because branch instructions are frequent
• Basic block size: 6.3-10.6
Instructions
• Blocks originating from indirect branches (12.3% by duration) are not protected by DFC
Dynamic signature overhead
• Leon allows utilizing unused delay slots
• Performance overhead: 9.2%
(Leon), 10.8% (other ISAs), lower for superscalar
• Overhead can be decreased by embedding signatures in unused instruction bits
23
Resilience Optimization with Cross-Layer Techniques
Leon IVM
• Most injected faults vanish: 79.3% (Leon), 91.4% (IVM), remaining are meaningful
• Per-component vulnerability and protection vary more for IVM
• DFC provides better detection in IVM (detects 35% of meaningful errors) than Leon (23%)
• Although more intermittently per flip-flop than in Leon, IVM has a larger range of flip-flops that affect control flow, detectable by DFC, when injected with a fault
• Average vulnerability
(SER) improvement of DFC still relatively low
Leon
SER
Improvement
Area
Overhead
SER
Improvement
IVM
Area
Overhead
DFC 1.31x
27.3 (3.2)% 1.54x
0.16%
Har/Pari
ABFT
5.00x
1.05-1.39
2.6%
---
5.00x
1.12-1.34
1.50%
---
24
Resilience Optimization with Cross-Layer Techniques
Hardening + Parity: Area
• Overhead of hardening lowered by adding parity where applicable
• Need to use pipeline flush recovery to save area in
Leon
• Efficient ROB recovery in
IVM
Hardening/Parity + DFC: Area
• DFC increases overhead many fold due to recovery, not feasible in Leon
• DFC more efficient at higher SER targets
• Still marginal, within variation range of synthesis tools
25
Resilience Optimization with Cross-Layer Techniques
Hardening/Parity + ABFT: Area
• Combination of ABFT and hardware techniques decreases overall overhead
• More visible at lower
SER targets
Hardening/Parity + DFC + ABFT: Area
• Evaluate potential benefit of DFC at higher
SER targets
• Combination of DFC with
Har/Par and ABFT provides no benefit
• It increases overhead for all SER targets
26
Resilience Optimization with Cross-Layer Techniques
• We developed Data Flow Checking (DFC) and Algorithm-Based
Fault Tolerance (ABFT) for cross-layer combinations with hardening/parity while considering recovery options
• We show that intermittent protection of DFC provides coverage below reasonable design targets, thus requiring additional protection and obviating the need for DFC
(especially in Leon due to expensive recovery)
• Hardening/parity proves to the most efficient generic solution, while combination with ABFT further improves efficiency for particular algorithms
27
Architecture-Level
Cross-Layer
Protection
Existing Protection
Overhead
Characterization
Accelerator (FFT)
Cross-Layer
Protection
In preparation, 2015
28
Cross-Layer Protection in an Accelerator
Motivation
• Special-purpose accelerator is expected to be more vulnerable than a general-purpose core due to high utilization
• How much of the core area (flip-flops) needs to be protected for reasonable error rate targets?
• Accelerator could potentially benefit from a hardware implementation of Algorithm-Based Fault Tolerance (ABFT)
• Can this approach compete with hardware-level solutions
(hardening/parity/redundancy) ?
Contributions
• Analyze the FFT accelerator, an interesting case study, with respect to vulnerability and varying overhead for different designs
• Develop two versions of the hardware ABFT approach (weak and full) and analyze its overhead for different throughput capabilities
• Compare efficiency of hardening/parity and ABFT (and their cross-layer combination) as well as redundancy
29
Considered
Protection
Techniques
Logic
Hardening/Parity*
Redundancy
Algorithm-Based Fault
Tolerance (ABFT)
Memory
Parity, ECC
Cross-Layer Protection in an Accelerator
Platform Choice
Case Study:
FFT Accelerator [7]
Asynchronous
Processing
1024-bit Integer Input
3 Components
2kB Input & Output
Arrays
Around 500 flip-flops
Protection
Characterization
Implementation
Cell Library
(Hardening*)
Developed RTL Code
(Parity*, ABFT,
Redundancy)
Analysis
Area, power, delay via synthesis
Coverage via flip-flop fault injection
Run in RTL Simulation
*In collaboration with Stanford University
30
Cross-Layer Protection in an Accelerator
Hardware units of various capability can perform computation by iterating through the butterfly diagram
Source: C.S.Burrus, FFT Flowgraphs, 2009 [8]
Common 2x2 unit: processes 2 inputs, involves 1 complex multiplication and 2 complex additions
Source: J. A. Abraham, Fault-tolerant FFT networks, 1988 [9]
• FFT approximates a time-domain function with a series of sine waves
• One of the most popular accelerators, with a more complex structure
• Sensitive to errors due to data dependencies and sharing
• Has a strong and weak ABFT algorithm, with a varying range of overheads
31
Cross-Layer Protection in an Accelerator
FFT result check by comparing first input point to the sum of decoded output points
To maintain throughput, in full ABFT, each
2x2 unit requires 2 encoders and 2 decoders
• ABFT checks only the end result of the multiple-point FFT butterfly computation, recovery is done via reply
• Full version requires encoders and decoders for each input to the
2x2 FFT unit; less can be used with performance loss
32
Cross-Layer Protection in an Accelerator
Flip-flop-Level Fault Injection
• Similar fault injection methodology and flip-flop vulnerability ranking
• Custom implementation of the fault injector in RTL, tailored to the FFT core
Hardening, Parity
• Hardening adds redundancy to flip-flop cells for better immunity
• Parity adds a single check bits to bit word to very correctness on access, uses pipelining
Redundancy
• The most reliable protection by replicating entire core
• More than doubles the area overhead
(redundant copy + checker)
Pipelined Parity Redundancy
33
32
Cross-Layer Protection in an Accelerator
FFT Core
• FFT core size varies depending on computation capability (half 2x2, full 2x2, two 2x2) and required SRAM array (multiport for 2x2 and higher)
• Full-size implementation of ABFT checker, that maintains throughput of the FFT core, is larger than redundancy (red line)
FFT Core + Resilience
(log scale)
Half 2x2 FFT core
[norm. to unprotected]
Full 2x2 FFT core
[norm. to unprotected]
34
Cross-Layer Protection in an Accelerator
FFT Core + Checker Configurations
Half 2x2 unit
(4 multipliers, 2 cycle latency, shared enc/dec)
Performance Impact
Full 2x2 unit
(2x4 multipliers, 1 cycle latency, separate enc/dec)
Duration of 1024-bit sequence computation
35
Cross-Layer Protection in an Accelerator
Distribution of injected faults.
Most are meaningful (OMM)
Protected vulnerable flip-flops, detected meaningful errors
• Most injected faults cause meaningful errors (output mismatch) due to high utilization, unlike in general-purpose cores
• ABFT does not cover RAM controller, requiring additional protection
• Weak ABFT detects 43% of errors, full ABFT 91%, redundancy 99.9%
36
Cross-Layer Protection in an Accelerator
Full ABFT (complemented with hardening/parity)
Is still inefficient, even with reduced checker area
Half 2x2 Core Full 2x2 Core
Weak ABFT complemented with hardening/parity is optimal
• Although partial, coverage of weak ABFT has low overhead, which allows achieving optimal protection when complemented with hardening/parity
• In spite of reducing the checker area in full ABFT, it still has a large overhead, unless further major detriment to performance is acceptable (full 2x2 core)
37
Cross-Layer Protection in an Accelerator
• We implemented ABFT protection for the FFT accelerator to compare against hardening/parity (and their cross-layer combinations ) as well as redundancy
• We illustrate increased vulnerability of the special-purpose
FFT unit, which translates to higher overhead of hardening/parity and improved ABFT efficiency
• We show that although it has partial coverage, only weak
ABFT provides optimal protection in combination with hardening/parity due to its low overhead (unlike full ABFT, even with a reduced checker)
38
• In Chapter 1, we improve the understanding of relative protection overheads in common resilience solutions, and illustrate their efficiency in achieving comprehensive coverage
• In Chapter 2 and 3 we prove the dissertation hypothesis by illustrating benefits of combining hardening/parity with ABFT with our cross-layer analysis and low-level fault injection
• The low coverage of DFC and the high recovery overhead (in
Leon) make DFC uncompetitive against hardening/parity. Only the more efficient ABFT can lower the overhead of hardening/parity in an accelerator, that is more sensitive to overhead.
39
[1]
[2]
[3]
[4]
[5]
[6]
[6]
[7]
[8]
[9]
R. Baumann et al. Soft Errors in Advanced Computer Systems. IEEE Design and Test of Computers,
22(3):258-266, 2005.
N. Seifert et al. Soft Error Susceptibilities of 22 nm Tri-Gate Devices. IEEE Transactions on Nuclear
Science, 59:2666-2673, 2012.
N. George et al. Bit-slice logic interleaving for spatial multi-bit soft-error tolerance. International
Conference on Dependable Systems and Networks (DSN), 141-150.
OpenCores. OpenRISC 1200. http://opencores.org/or1k/Main_Page. 2013.
Aeroflex Gaisler. Leon3. http://www.gaisler.com/index.php/products/processors/leon3?task=view&id=13. 2014.
N. J. Wang et al. Characterizing the effects of transient faults on a high-performance processor pipeline. Dependable Systems and Networks, pp.61,70, 28 June-1 July 2004.
C. S. Burrus. FFT Flowgraphs. http://cnx.org/content/m16352/1.11/. 2009.
J. A. Abraham et al. Fault-tolerant FFT networks. IEEE Transactions on Computers, 548-561, 1988.
DigitalFIlter. FFT Lab. http://digitalfilter.com/enindex.html. 2014.
Wikipedia. Static random-access memory. Flip-flip. And Gate. http://www.wikipedia.org/. 2014.
40
L. G. Szafaryn, E. Cheng, C. Y. Cher, M. Stan, S. Mitra and K. Skadron. "Hardware and Software Algorithm-based Resilience Tradeoffs in
Accelerators." In preparation, 2015.
L. G. Szafaryn, E. Cheng, H. Cho, C. Y. Cher, S. Mirkhani, M. Stan, K. Skadron, S. Mitra and J. Abraham. "Cross-Layer Resilience
Exploration." parts submitted, prepared for another submission, 2015.
S. Mitra, P. Bose, E. Cheng, C. Y. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. G.
Szafaryn. "The Resilience Wall: Cross-Layer Solution Strategies." In the International Symposium on VLSI Technology, Systems and
Applications (VLSI-TSA), April 2014.
L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core." In IEEE Micro,
August 2013.
• L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Trellis: Portability across Architectures with a High-level Framework." In the Journal of Parallel and Distributed Computing (JPDC), August 2013.
• L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Soft Error Protection Mechanisms in the Context of Multi-bit Errors at the Scope of a Processor." At SRC TECHCON, October 2012.
• L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous
Architectures." In the Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing
(WOLFHPC), in conjunction with the 25th International Conference on Supercomputing (ICS), May 2011.
• S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. “Characterization of the Rodinia Benchmark Suite with
Comparison to Contemporary CMP Workloads.” In the IEEE International Symposium on Workload Characterization (IISWC), December
2010.
• M. Guevara, P. Wu, M. D. Marino, J. Meng, L. G. Szafaryn, P. Satyamoorthy, B. Meyer, K. Skadron, J. Lach and B. Calhoun. "Exploiting
Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores." At SRC TECHCON 2010, September 2010.
• L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In the Workshop on
Biomedicine in Computing: Systems, Architectures, and Circuits (BiC), in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009.
• L. G. Szafaryn, J. Saucerman, S. Acton, and K. Skadron. “A Tale of Two Systems Biology Applications: Experiences Accelerating
Ultrasound Feature Tracking vs. Multi-Timescale Simulations.” University of Virginia Dept. of Computer Science Technical Report CS-
2009-01, March 2009.
41
Questions
42