Pre-RTL On-chip Power Delivery Modeling and Analysis

advertisement

Understanding and Optimizing

Heterogeneous Soft-Error Protection

Ph.D. Defense Presentation by Lukasz G. Szafaryn

Advisor:

Prof. Kevin Skadron

April 14

th

, 2015

Dissertation Outline

The Soft-Error Problem

Dissertation Scope

Chapter 1

Characterization of the Protection Overhead

Chapter 2

Resilience Optimization with Cross-Layer Techniques

Chapter 3

Cross Layer Protection in an Accelerator

Dissertation Summary

2

Sources

• Cosmic rays

• Alpha Particles

Soft-Error Problem

What are Soft Errors?

Vulnerability

• Most errors in sequential logic (flip-flop, SRAM)

• Fewer errors in combinational logic (gates)

Particle strike and resulting current pulse

Source: R. Baumann et al, IEEE DTC, 2005 [1]

Masking

• Electrical, temporal

• Logical

• Architectural

• OS, software

• Soft-errors caused by particles; more important in storage cells

• Most are masked in HW-Arch-SW hierarchy, many remain

3

Soft-Error Problem

What affects the soft-error rate?

Decrease

– Multi-gate transistors and Silicon-on-

Insulator technologies

• smaller sensitive area

Increase

– Lower voltage, smaller critical charge

– More multi-bit errors

– Higher altitude

– Larger process variation

– Increase with chip and computation complexity

Decreasing cell vulnerability

Source: N. Seifert et al, IEEE ToNS, 2012 [2]

Particle strike causing multi-bit errors

Source: N. George et al, DSN, 2010 [3]

4

Soft-Error Problem

Existing Resilience Approaches

Specific Circuit/Component-Level Protection

Combinational Sequential SRAM

Source: Wikipedia, 2014 [10]

Redundancy*^,

Arithmetic Codes (ALU/FPU)

Parity*, ECC*^,

Upsizing, Hardening

General Core-Level Protection

– Instruction Replication^ / Redundant Multi-threading (RMT)^

– Control/Data Flow Checking (CFC/DFC)

– Algorithm-Based Fault Tolerance (ABFT)

– Other SW and HW, less efficient

* - commonly used

^ - high overhead

bold – analyzed in this work

5

Dissertation Scope

Problem:

Common resilience based on component redundancy or workload replication/restore has too much overhead

Hypothesis:

Achieving better coverage at lower cost will require analysis of other, cross-layer, solutions and their combinations

Architecture-

Level

Cross-Layer

Protection

Existing

Protection

Overhead

Characterization

Accelerator (FFT)

Cross-Layer

Protection

6

Chapter 1

Characterization of the Protection Overhead

Architecture-Level

Cross-Layer

Protection

Existing Protection

Overhead

Characterization

IEEE Micro, 2013

Accelerator (FFT)

Cross-Layer

Protection

7

Characterization of the Protection Overhead

Overview

Motivation

– High overhead of redundancy, the comprehensive solution

• How do other techniques scale for better coverage?

– Important to understand the overhead breakdown

• Determine efficiency of common techniques

• Illustrate optimal designs currently possible

• Serve as baseline to facilitate further exploration

Contributions

– Develop resilience techniques for example core and place in the exploration framework

– Provide area, power and delay overhead analysis at component and core levels

– No previous work to provide such comprehensive resilience analysis in a single core

8

Parity

ECC

Considered

Protection

Techniques

More common

Characterization of the Protection Overhead

Methodology

Redundancy

Less common

Hardening

Arithmetic Codes

(parity prediction, residue codes)

Instruction Replication

Platform Choice

Case Study:

OpenRISC [4]

Throughput and mobile designs

Scalar, In-Order (IO)

5-Stage Pipeline

4kB Inst & Data

Caches

Around 1300 flip-flops

Protection

Characterization

Implementation

Developed RTL code

Literature estimates

(Hardening)

Analysis

Area, power, delay via synthesis

Integrated into Svalinn framework

Estimation for other cores

9

Characterization of the Protection Overhead

Protection Techniques

Hardening

Improves flip-flop immunity by adding redundant transistors to hold the state on an upset or cancel its effect

Codes (parity, ECC, arithmetic)

Parity - adds a single check bit to a bit word to verify correctness on access

ECC - adds multiple check bits to a bit word to detect 2 errors and correct 1

Arithmetic – parity and residue codes, predict output properties based on input

Redundancy

Replicates component and compares results from the two copies to check for errors

10

Characterization of the Protection Overhead

Case Study: OpenRISC Core

Memory Units Pipeline Stages

Component Area, Power, Delay

[normalized to highest for each category]

Core Area by Cell Type

• Memory, pipeline and arithmetic units vary in area, delay and power

• Depending on resilience solution, protection overhead is affected by:

– number of latches (hardening, parity, ECC)

– component size (redundancy)

11

Memory Units

Characterization of the Protection Overhead

Component-Level Area

Pipeline Stages

Protection Area Breakdown

[normalized to highest]

Area of Protected Components

[normalized to highest]

• ECC (encoders/decoders) too area-expensive for non-memory units, Redundancy

(copy) too expensive for all units

• Code encoders/decoders and comparator checkers are significant overhead contributors, relatively large for a small core

• Hardening , parity and function-specific (arithmetic codes) solutions remain most efficient for the protection of non-memory components

12

Characterization of the Protection Overhead

Core-Level Area

Varied coverage, no major performance loss

Performance traded for area

Area of Protected Core

[normalized to original, unprotected]

• Most common designs use ECC for memory and redundancy for logic (bar 2)

• Overhead of logic protection be further reduced by hardening (bar 4) and even more by less robust parity (bar 6)

• Arithmetic codes provide potential for more area optimization (bars 3, 5, 7) Ultimately, area can be significantly reduced at the cost of performance with instruction replication

13

Characterization of the Protection Overhead

Chapter 1 Summary

• We implemented a range of common protection techniques for comparative analysis in a single core, with a breakdown into components

• We illustrate the sources of overheads including component size, number of flip-flops as well as checker, encoder/decoder size and additional check bit storage

• We show how more optimal core-level protection is achieved by applying techniques according to their best fit, with relative gains illustrated

14

Chapter 2

Resilience Optimization with

Cross-Layer Techniques

Existing Protection

Overhead

Characterization

Architecture-Level

Cross-Layer

Protection

VLSI-TSA, 2014

In submission, 2015

Accelerator (FFT)

Cross-Layer

Protection

15

Resilience Optimization with Cross-Layer Techniques

Overview

Motivation

– Common solutions for logic circuits too expensive in area

(redundancy), hardening and parity are viable options

– Need to explore alternative, architecture and software solutions

• Can these provide efficient coverage for relevant circuits?

– Vulnerability and coverage needs more accurate testing

Contributions

– Evaluate architecture-level Data-Flow Checking (DFC) and Algorithm-

Based Fault Tolerance (ABFT)

– Combine with hardware and more software techniques into crosslayer solutions for improved efficiency*

– Present results for two (simple and complex) architectures based on flip-flop-level fault injection*

*In collaboration with Stanford University

16

Resilience Optimization with Cross-Layer Techniques

Methodology

Considered

Protection

Techniques

Hardware

Hardening/Parity*

Architecture

Data-Flow Checking

(DFC)

Software

Algorithm-Based Fault

Tolerance (ABFT)*

Platform Choice

Case Study:

1) Leon [5]

Scalar, In-Order (IO)

7-Stage Pipeline, SPARC ISA

4kB Inst & Data Caches

Around 1,200 flip-flops

Case Study:

2) Illinois Verilog Model (IVM) [6]

Superscalar, Out-of-Order (IO)

12-Stage Pipeline, Alpha ISA

8kB Inst & 32kB Data Caches

Around 14,000 flip-flops

Protection

Characterization

Implementation

Cell library (Hardening)*

Implement RTL (Parity, DFC)

Create compiler tool (DFC)

Develop Software (ABFT)*

Analysis

Area, power, delay via synthesis

Coverage via flip-flop fault injection*

Run in FPGA (Leon) or RTL

Simulation (IVM)*

PERFECT, SPEC Benchmarks

*In collaboration with Stanford University

17

Resilience Optimization with Cross-Layer Techniques

Why

Data-Flow Checking (DFC) ?

Valid transitions between basic blocks in the binary on branch instructions

Data flow check in the Writeback (Leon) or

Decode (IVM) stage via signature evaluation

• Data-flow checking (DFC) protects against errors in instruction-flow crucial to program execution, not easily masked

• Data-flow checking efficiently compresses a large amount of relevant pipeline state in a signature checked for every basic-block

• Original Hypothesis: DFC improves efficiency of protection by reducing need for complementary solutions (hardening and parity)

18

Resilience Optimization with Cross-Layer Techniques

Algorithm-Based Fault Tolerance (ABFT)

Concept

• Uses algorithm properties that relate inputs to outputs to check result

• Some algorithms employ additional encoding/decoding for better detection (FFT)

Applications

• FFT1D Weak

• FFT1D Full

• FFT2D Weak

• FFT2D Full

• 2D Convolution*

• Motion Imaging*

• Wavelet Transform*

• Inner Product*

*In collaboration with Stanford University

19

Resilience Optimization with Cross-Layer Techniques

Injection and Other Resilience Techniques

Flip-flop-Level Fault Injection*

• Determine vulnerability and detection via flip-flop-level fault injection

• Rank flip-flops according to vulnerability

• Allows designers to target error reduction via gradual protection of ranked flip-flops

Hardening, Parity*

Hardening adds redundancy to flipflop cells for better immunity

Parity adds a single check bit to a bit word to verify correctness on access, uses pipelining

Pipelined Parity

*In collaboration with Stanford University

20

Resilience Optimization with Cross-Layer Techniques

Ultimate Consideration: Local Recovery

Pipeline Flush Recovery Unit (Parity) Recovery Unit (DFC)

• Takes advantage of built-in pipeline flush mechanism in the core

• Can recover only if error is detected before reaching memory stage

• Buffers processor state and register file for the duration of pipeline length or more

• Buffers pipeline state, register file and memory stores for (unbounded) basic block duration

Significant overhead in Leon

IVM can utilize Reorder Buffer (ROB) instead

21

Resilience Optimization with Cross-Layer Techniques

DFC Area

DFC Components

Checker – compute run-time signature, compare against static version

Recovery Controller – manage buffering and recovery

Buffer - store pipeline state and memory requests from basic block (~15 entries max)

Shadow register file - restore pipeline state

Leon:

Add DFC + recovery

IVM:

Add DFC + recovery

Overhead < 1%

(~50x larger core, no recovery buffer or shadow register file needed)

22

Resilience Optimization with Cross-Layer Techniques

DFC Performance

Dynamic basic block structure

• Basic blocks are small because branch instructions are frequent

• Basic block size: 6.3-10.6

Instructions

• Blocks originating from indirect branches (12.3% by duration) are not protected by DFC

Dynamic signature overhead

• Leon allows utilizing unused delay slots

• Performance overhead: 9.2%

(Leon), 10.8% (other ISAs), lower for superscalar

• Overhead can be decreased by embedding signatures in unused instruction bits

23

Resilience Optimization with Cross-Layer Techniques

DFC Coverage

Leon IVM

• Most injected faults vanish: 79.3% (Leon), 91.4% (IVM), remaining are meaningful

• Per-component vulnerability and protection vary more for IVM

• DFC provides better detection in IVM (detects 35% of meaningful errors) than Leon (23%)

• Although more intermittently per flip-flop than in Leon, IVM has a larger range of flip-flops that affect control flow, detectable by DFC, when injected with a fault

• Average vulnerability

(SER) improvement of DFC still relatively low

Leon

SER

Improvement

Area

Overhead

SER

Improvement

IVM

Area

Overhead

DFC 1.31x

27.3 (3.2)% 1.54x

0.16%

Har/Pari

ABFT

5.00x

1.05-1.39

2.6%

---

5.00x

1.12-1.34

1.50%

---

24

Resilience Optimization with Cross-Layer Techniques

Cross-Layer: Hardware-Architecture

Hardening + Parity: Area

• Overhead of hardening lowered by adding parity where applicable

• Need to use pipeline flush recovery to save area in

Leon

• Efficient ROB recovery in

IVM

Hardening/Parity + DFC: Area

• DFC increases overhead many fold due to recovery, not feasible in Leon

• DFC more efficient at higher SER targets

• Still marginal, within variation range of synthesis tools

25

Resilience Optimization with Cross-Layer Techniques

Cross-Layer: Hardware-Architecture-Software

Hardening/Parity + ABFT: Area

• Combination of ABFT and hardware techniques decreases overall overhead

• More visible at lower

SER targets

Hardening/Parity + DFC + ABFT: Area

• Evaluate potential benefit of DFC at higher

SER targets

• Combination of DFC with

Har/Par and ABFT provides no benefit

• It increases overhead for all SER targets

26

Resilience Optimization with Cross-Layer Techniques

Chapter Summary

• We developed Data Flow Checking (DFC) and Algorithm-Based

Fault Tolerance (ABFT) for cross-layer combinations with hardening/parity while considering recovery options

• We show that intermittent protection of DFC provides coverage below reasonable design targets, thus requiring additional protection and obviating the need for DFC

(especially in Leon due to expensive recovery)

• Hardening/parity proves to the most efficient generic solution, while combination with ABFT further improves efficiency for particular algorithms

27

Chapter 3

Cross-Layer Protection in an Accelerator

Architecture-Level

Cross-Layer

Protection

Existing Protection

Overhead

Characterization

Accelerator (FFT)

Cross-Layer

Protection

In preparation, 2015

28

Cross-Layer Protection in an Accelerator

Overview

Motivation

• Special-purpose accelerator is expected to be more vulnerable than a general-purpose core due to high utilization

• How much of the core area (flip-flops) needs to be protected for reasonable error rate targets?

• Accelerator could potentially benefit from a hardware implementation of Algorithm-Based Fault Tolerance (ABFT)

• Can this approach compete with hardware-level solutions

(hardening/parity/redundancy) ?

Contributions

• Analyze the FFT accelerator, an interesting case study, with respect to vulnerability and varying overhead for different designs

• Develop two versions of the hardware ABFT approach (weak and full) and analyze its overhead for different throughput capabilities

• Compare efficiency of hardening/parity and ABFT (and their cross-layer combination) as well as redundancy

29

Considered

Protection

Techniques

Logic

Hardening/Parity*

Redundancy

Algorithm-Based Fault

Tolerance (ABFT)

Memory

Parity, ECC

Cross-Layer Protection in an Accelerator

Methodology

Platform Choice

Case Study:

FFT Accelerator [7]

Asynchronous

Processing

1024-bit Integer Input

3 Components

2kB Input & Output

Arrays

Around 500 flip-flops

Protection

Characterization

Implementation

Cell Library

(Hardening*)

Developed RTL Code

(Parity*, ABFT,

Redundancy)

Analysis

Area, power, delay via synthesis

Coverage via flip-flop fault injection

Run in RTL Simulation

*In collaboration with Stanford University

30

Cross-Layer Protection in an Accelerator

Why Fast Fourier Transform (FFT) Accelerator?

Hardware units of various capability can perform computation by iterating through the butterfly diagram

Source: C.S.Burrus, FFT Flowgraphs, 2009 [8]

Common 2x2 unit: processes 2 inputs, involves 1 complex multiplication and 2 complex additions

Source: J. A. Abraham, Fault-tolerant FFT networks, 1988 [9]

• FFT approximates a time-domain function with a series of sine waves

• One of the most popular accelerators, with a more complex structure

• Sensitive to errors due to data dependencies and sharing

• Has a strong and weak ABFT algorithm, with a varying range of overheads

31

Cross-Layer Protection in an Accelerator

Algorithm-Based Fault Tolerance (ABFT)

FFT result check by comparing first input point to the sum of decoded output points

To maintain throughput, in full ABFT, each

2x2 unit requires 2 encoders and 2 decoders

• ABFT checks only the end result of the multiple-point FFT butterfly computation, recovery is done via reply

• Full version requires encoders and decoders for each input to the

2x2 FFT unit; less can be used with performance loss

32

Cross-Layer Protection in an Accelerator

Injection and Other Resilience Techniques

Flip-flop-Level Fault Injection

• Similar fault injection methodology and flip-flop vulnerability ranking

• Custom implementation of the fault injector in RTL, tailored to the FFT core

Hardening, Parity

Hardening adds redundancy to flip-flop cells for better immunity

Parity adds a single check bits to bit word to very correctness on access, uses pipelining

Redundancy

• The most reliable protection by replicating entire core

• More than doubles the area overhead

(redundant copy + checker)

Pipelined Parity Redundancy

33

32

Cross-Layer Protection in an Accelerator

Resilience Area

FFT Core

• FFT core size varies depending on computation capability (half 2x2, full 2x2, two 2x2) and required SRAM array (multiport for 2x2 and higher)

• Full-size implementation of ABFT checker, that maintains throughput of the FFT core, is larger than redundancy (red line)

FFT Core + Resilience

(log scale)

Half 2x2 FFT core

[norm. to unprotected]

Full 2x2 FFT core

[norm. to unprotected]

34

Cross-Layer Protection in an Accelerator

ABFT Resilience Performance

FFT Core + Checker Configurations

Half 2x2 unit

(4 multipliers, 2 cycle latency, shared enc/dec)

Performance Impact

Full 2x2 unit

(2x4 multipliers, 1 cycle latency, separate enc/dec)

Duration of 1024-bit sequence computation

35

Cross-Layer Protection in an Accelerator

Individual Resilience Coverage

Distribution of injected faults.

Most are meaningful (OMM)

Protected vulnerable flip-flops, detected meaningful errors

• Most injected faults cause meaningful errors (output mismatch) due to high utilization, unlike in general-purpose cores

• ABFT does not cover RAM controller, requiring additional protection

• Weak ABFT detects 43% of errors, full ABFT 91%, redundancy 99.9%

36

Cross-Layer Protection in an Accelerator

Cross-Layer Resilience Coverage

Full ABFT (complemented with hardening/parity)

Is still inefficient, even with reduced checker area

Half 2x2 Core Full 2x2 Core

Weak ABFT complemented with hardening/parity is optimal

• Although partial, coverage of weak ABFT has low overhead, which allows achieving optimal protection when complemented with hardening/parity

• In spite of reducing the checker area in full ABFT, it still has a large overhead, unless further major detriment to performance is acceptable (full 2x2 core)

37

Cross-Layer Protection in an Accelerator

Chapter Summary

• We implemented ABFT protection for the FFT accelerator to compare against hardening/parity (and their cross-layer combinations ) as well as redundancy

• We illustrate increased vulnerability of the special-purpose

FFT unit, which translates to higher overhead of hardening/parity and improved ABFT efficiency

• We show that although it has partial coverage, only weak

ABFT provides optimal protection in combination with hardening/parity due to its low overhead (unlike full ABFT, even with a reduced checker)

38

Dissertation Summary

In Chapter 1, we improve the understanding of relative protection overheads in common resilience solutions, and illustrate their efficiency in achieving comprehensive coverage

In Chapter 2 and 3 we prove the dissertation hypothesis by illustrating benefits of combining hardening/parity with ABFT with our cross-layer analysis and low-level fault injection

• The low coverage of DFC and the high recovery overhead (in

Leon) make DFC uncompetitive against hardening/parity. Only the more efficient ABFT can lower the overhead of hardening/parity in an accelerator, that is more sensitive to overhead.

39

[1]

[2]

[3]

[4]

[5]

[6]

[6]

[7]

[8]

[9]

References

R. Baumann et al. Soft Errors in Advanced Computer Systems. IEEE Design and Test of Computers,

22(3):258-266, 2005.

N. Seifert et al. Soft Error Susceptibilities of 22 nm Tri-Gate Devices. IEEE Transactions on Nuclear

Science, 59:2666-2673, 2012.

N. George et al. Bit-slice logic interleaving for spatial multi-bit soft-error tolerance. International

Conference on Dependable Systems and Networks (DSN), 141-150.

OpenCores. OpenRISC 1200. http://opencores.org/or1k/Main_Page. 2013.

Aeroflex Gaisler. Leon3. http://www.gaisler.com/index.php/products/processors/leon3?task=view&id=13. 2014.

N. J. Wang et al. Characterizing the effects of transient faults on a high-performance processor pipeline. Dependable Systems and Networks, pp.61,70, 28 June-1 July 2004.

C. S. Burrus. FFT Flowgraphs. http://cnx.org/content/m16352/1.11/. 2009.

J. A. Abraham et al. Fault-tolerant FFT networks. IEEE Transactions on Computers, 548-561, 1988.

DigitalFIlter. FFT Lab. http://digitalfilter.com/enindex.html. 2014.

Wikipedia. Static random-access memory. Flip-flip. And Gate. http://www.wikipedia.org/. 2014.

40

Publications

L. G. Szafaryn, E. Cheng, C. Y. Cher, M. Stan, S. Mitra and K. Skadron. "Hardware and Software Algorithm-based Resilience Tradeoffs in

Accelerators." In preparation, 2015.

L. G. Szafaryn, E. Cheng, H. Cho, C. Y. Cher, S. Mirkhani, M. Stan, K. Skadron, S. Mitra and J. Abraham. "Cross-Layer Resilience

Exploration." parts submitted, prepared for another submission, 2015.

 S. Mitra, P. Bose, E. Cheng, C. Y. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. G.

Szafaryn. "The Resilience Wall: Cross-Layer Solution Strategies." In the International Symposium on VLSI Technology, Systems and

Applications (VLSI-TSA), April 2014.

L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core." In IEEE Micro,

August 2013.

L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Trellis: Portability across Architectures with a High-level Framework." In the Journal of Parallel and Distributed Computing (JPDC), August 2013.

L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Soft Error Protection Mechanisms in the Context of Multi-bit Errors at the Scope of a Processor." At SRC TECHCON, October 2012.

L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous

Architectures." In the Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing

(WOLFHPC), in conjunction with the 25th International Conference on Supercomputing (ICS), May 2011.

• S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. “Characterization of the Rodinia Benchmark Suite with

Comparison to Contemporary CMP Workloads.” In the IEEE International Symposium on Workload Characterization (IISWC), December

2010.

• M. Guevara, P. Wu, M. D. Marino, J. Meng, L. G. Szafaryn, P. Satyamoorthy, B. Meyer, K. Skadron, J. Lach and B. Calhoun. "Exploiting

Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores." At SRC TECHCON 2010, September 2010.

L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In the Workshop on

Biomedicine in Computing: Systems, Architectures, and Circuits (BiC), in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009.

L. G. Szafaryn, J. Saucerman, S. Acton, and K. Skadron. “A Tale of Two Systems Biology Applications: Experiences Accelerating

Ultrasound Feature Tracking vs. Multi-Timescale Simulations.” University of Virginia Dept. of Computer Science Technical Report CS-

2009-01, March 2009.

41

Questions

42

Download