Pre-RTL On-chip Power Delivery Modeling and Analysis

Understanding and Optimizing

Heterogeneous Soft-Error Protection

Ph.D. Defense Presentation by Lukasz G. Szafaryn

Advisor:

Prof. Kevin Skadron

April 14

th

, 2015

Dissertation Outline

•

The Soft-Error Problem

•

Dissertation Scope

•

Chapter 1

Characterization of the Protection Overhead

•

Chapter 2

Resilience Optimization with Cross-Layer Techniques

•

Chapter 3

Cross Layer Protection in an Accelerator

•

Dissertation Summary

2

•

Sources

• Cosmic rays

• Alpha Particles

Soft-Error Problem

What are Soft Errors?

•

Vulnerability

• Most errors in sequential logic (flip-flop, SRAM)

• Fewer errors in combinational logic (gates)

Particle strike and resulting current pulse

Source: R. Baumann et al, IEEE DTC, 2005 [1]

•

Masking

• Electrical, temporal

• Logical

• Architectural

• OS, software

• Soft-errors caused by particles; more important in storage cells

• Most are masked in HW-Arch-SW hierarchy, many remain

3

Soft-Error Problem

What affects the soft-error rate?

•

Decrease

– Multi-gate transistors and Silicon-on-

Insulator technologies

• smaller sensitive area

•

Increase

– Lower voltage, smaller critical charge

– More multi-bit errors

– Higher altitude

– Larger process variation

– Increase with chip and computation complexity

Decreasing cell vulnerability

Source: N. Seifert et al, IEEE ToNS, 2012 [2]

Particle strike causing multi-bit errors

Source: N. George et al, DSN, 2010 [3]

4

Soft-Error Problem

Existing Resilience Approaches

•

Specific Circuit/Component-Level Protection

Combinational Sequential SRAM

Source: Wikipedia, 2014 [10]

Redundancy*^,

Arithmetic Codes (ALU/FPU)

Parity*, ECC*^,

Upsizing, Hardening

•

General Core-Level Protection

– Instruction Replication^ / Redundant Multi-threading (RMT)^

– Control/Data Flow Checking (CFC/DFC)

– Algorithm-Based Fault Tolerance (ABFT)

– Other SW and HW, less efficient

* - commonly used

^ - high overhead

bold – analyzed in this work

5

Dissertation Scope

•

Problem:

Common resilience based on component redundancy or workload replication/restore has too much overhead

•

Hypothesis:

Achieving better coverage at lower cost will require analysis of other, cross-layer, solutions and their combinations

Architecture-

Level

Cross-Layer

Protection

Existing

Protection

Overhead

Characterization

Accelerator (FFT)

Cross-Layer

Protection

6

Chapter 1


Architecture-Level

Cross-Layer

Protection

Existing Protection

Overhead

Characterization

IEEE Micro, 2013

Accelerator (FFT)

Cross-Layer

Protection

7


Overview

Motivation

– High overhead of redundancy, the comprehensive solution

• How do other techniques scale for better coverage?

– Important to understand the overhead breakdown

• Determine efficiency of common techniques

• Illustrate optimal designs currently possible

• Serve as baseline to facilitate further exploration

Contributions

– Develop resilience techniques for example core and place in the exploration framework

– Provide area, power and delay overhead analysis at component and core levels

– No previous work to provide such comprehensive resilience analysis in a single core

8

Parity

ECC

Considered

Protection

Techniques

More common


Methodology

Redundancy

Less common

Hardening

Arithmetic Codes

(parity prediction, residue codes)

Instruction Replication

Platform Choice

Case Study:

OpenRISC [4]

Throughput and mobile designs

Scalar, In-Order (IO)

5-Stage Pipeline

4kB Inst & Data

Caches

Around 1300 flip-flops

Protection

Characterization

Implementation

Developed RTL code

Literature estimates

(Hardening)

Analysis

Area, power, delay via synthesis

Integrated into Svalinn framework

Estimation for other cores

9


Protection Techniques

Hardening

Improves flip-flop immunity by adding redundant transistors to hold the state on an upset or cancel its effect

Codes (parity, ECC, arithmetic)

• Parity - adds a single check bit to a bit word to verify correctness on access

• ECC - adds multiple check bits to a bit word to detect 2 errors and correct 1

• Arithmetic – parity and residue codes, predict output properties based on input

Redundancy

Replicates component and compares results from the two copies to check for errors

10


Case Study: OpenRISC Core

Memory Units Pipeline Stages

Component Area, Power, Delay

[normalized to highest for each category]

Core Area by Cell Type

• Memory, pipeline and arithmetic units vary in area, delay and power

• Depending on resilience solution, protection overhead is affected by:

– number of latches (hardening, parity, ECC)

– component size (redundancy)

11

Memory Units


Component-Level Area

Pipeline Stages

Protection Area Breakdown

[normalized to highest]

Area of Protected Components

[normalized to highest]

• ECC (encoders/decoders) too area-expensive for non-memory units, Redundancy

(copy) too expensive for all units

• Code encoders/decoders and comparator checkers are significant overhead contributors, relatively large for a small core

• Hardening , parity and function-specific (arithmetic codes) solutions remain most efficient for the protection of non-memory components

12


Core-Level Area

Varied coverage, no major performance loss

Performance traded for area

Area of Protected Core

[normalized to original, unprotected]

• Most common designs use ECC for memory and redundancy for logic (bar 2)

• Overhead of logic protection be further reduced by hardening (bar 4) and even more by less robust parity (bar 6)

• Arithmetic codes provide potential for more area optimization (bars 3, 5, 7) Ultimately, area can be significantly reduced at the cost of performance with instruction replication

13


Chapter 1 Summary

• We implemented a range of common protection techniques for comparative analysis in a single core, with a breakdown into components

• We illustrate the sources of overheads including component size, number of flip-flops as well as checker, encoder/decoder size and additional check bit storage

• We show how more optimal core-level protection is achieved by applying techniques according to their best fit, with relative gains illustrated

14

Chapter 2

Resilience Optimization with

Cross-Layer Techniques

Existing Protection

Overhead

Characterization

Architecture-Level

Cross-Layer

Protection

VLSI-TSA, 2014

In submission, 2015

Accelerator (FFT)

Cross-Layer

Protection

15


Overview

Motivation

– Common solutions for logic circuits too expensive in area

(redundancy), hardening and parity are viable options

– Need to explore alternative, architecture and software solutions

• Can these provide efficient coverage for relevant circuits?

– Vulnerability and coverage needs more accurate testing

Contributions

– Evaluate architecture-level Data-Flow Checking (DFC) and Algorithm-

Based Fault Tolerance (ABFT)

– Combine with hardware and more software techniques into crosslayer solutions for improved efficiency*

– Present results for two (simple and complex) architectures based on flip-flop-level fault injection*

*In collaboration with Stanford University

16


Methodology

Considered

Protection

Techniques

Hardware

Hardening/Parity*

Architecture

Data-Flow Checking

(DFC)

Software

Algorithm-Based Fault

Tolerance (ABFT)*

Platform Choice

Case Study:

1) Leon [5]

Scalar, In-Order (IO)

7-Stage Pipeline, SPARC ISA

4kB Inst & Data Caches

Around 1,200 flip-flops

Case Study:

2) Illinois Verilog Model (IVM) [6]

Superscalar, Out-of-Order (IO)

12-Stage Pipeline, Alpha ISA

8kB Inst & 32kB Data Caches

Around 14,000 flip-flops

Protection


Implementation

Cell library (Hardening)*

Implement RTL (Parity, DFC)

Create compiler tool (DFC)

Develop Software (ABFT)*

Analysis


Coverage via flip-flop fault injection*

Run in FPGA (Leon) or RTL

Simulation (IVM)*

PERFECT, SPEC Benchmarks


17


Why

Data-Flow Checking (DFC) ?

Valid transitions between basic blocks in the binary on branch instructions

Data flow check in the Writeback (Leon) or

Decode (IVM) stage via signature evaluation

• Data-flow checking (DFC) protects against errors in instruction-flow crucial to program execution, not easily masked

• Data-flow checking efficiently compresses a large amount of relevant pipeline state in a signature checked for every basic-block

• Original Hypothesis: DFC improves efficiency of protection by reducing need for complementary solutions (hardening and parity)

18


Algorithm-Based Fault Tolerance (ABFT)

Concept

• Uses algorithm properties that relate inputs to outputs to check result

• Some algorithms employ additional encoding/decoding for better detection (FFT)

Applications

• FFT1D Weak

• FFT1D Full

• FFT2D Weak

• FFT2D Full

• 2D Convolution*

• Motion Imaging*

• Wavelet Transform*

• Inner Product*


19


Injection and Other Resilience Techniques

Flip-flop-Level Fault Injection*

• Determine vulnerability and detection via flip-flop-level fault injection

• Rank flip-flops according to vulnerability

• Allows designers to target error reduction via gradual protection of ranked flip-flops

Hardening, Parity*

• Hardening adds redundancy to flipflop cells for better immunity

• Parity adds a single check bit to a bit word to verify correctness on access, uses pipelining

Pipelined Parity


20


Ultimate Consideration: Local Recovery

Pipeline Flush Recovery Unit (Parity) Recovery Unit (DFC)

• Takes advantage of built-in pipeline flush mechanism in the core

• Can recover only if error is detected before reaching memory stage

• Buffers processor state and register file for the duration of pipeline length or more

• Buffers pipeline state, register file and memory stores for (unbounded) basic block duration

Significant overhead in Leon

IVM can utilize Reorder Buffer (ROB) instead

21


DFC Area

DFC Components

• Checker – compute run-time signature, compare against static version

• Recovery Controller – manage buffering and recovery

• Buffer - store pipeline state and memory requests from basic block (~15 entries max)

• Shadow register file - restore pipeline state

Leon:

Add DFC + recovery

IVM:

Add DFC + recovery

Overhead < 1%

(~50x larger core, no recovery buffer or shadow register file needed)

22


DFC Performance

Dynamic basic block structure

• Basic blocks are small because branch instructions are frequent

• Basic block size: 6.3-10.6

Instructions

• Blocks originating from indirect branches (12.3% by duration) are not protected by DFC

Dynamic signature overhead

• Leon allows utilizing unused delay slots

• Performance overhead: 9.2%

(Leon), 10.8% (other ISAs), lower for superscalar

• Overhead can be decreased by embedding signatures in unused instruction bits

23


DFC Coverage

Leon IVM

• Most injected faults vanish: 79.3% (Leon), 91.4% (IVM), remaining are meaningful

• Per-component vulnerability and protection vary more for IVM

• DFC provides better detection in IVM (detects 35% of meaningful errors) than Leon (23%)

• Although more intermittently per flip-flop than in Leon, IVM has a larger range of flip-flops that affect control flow, detectable by DFC, when injected with a fault

• Average vulnerability

(SER) improvement of DFC still relatively low

Leon

SER

Improvement

Area

Overhead

SER

Improvement

IVM

Area

Overhead

DFC 1.31x

27.3 (3.2)% 1.54x

0.16%

Har/Pari

ABFT

5.00x

1.05-1.39

2.6%

---

5.00x

1.12-1.34

1.50%

---

24


Cross-Layer: Hardware-Architecture

Hardening + Parity: Area

• Overhead of hardening lowered by adding parity where applicable

• Need to use pipeline flush recovery to save area in

Leon

• Efficient ROB recovery in

IVM

Hardening/Parity + DFC: Area

• DFC increases overhead many fold due to recovery, not feasible in Leon

• DFC more efficient at higher SER targets

• Still marginal, within variation range of synthesis tools

25


Cross-Layer: Hardware-Architecture-Software

Hardening/Parity + ABFT: Area

• Combination of ABFT and hardware techniques decreases overall overhead

• More visible at lower

SER targets

Hardening/Parity + DFC + ABFT: Area

• Evaluate potential benefit of DFC at higher

SER targets

• Combination of DFC with

Har/Par and ABFT provides no benefit

• It increases overhead for all SER targets

26


Chapter Summary

• We developed Data Flow Checking (DFC) and Algorithm-Based

Fault Tolerance (ABFT) for cross-layer combinations with hardening/parity while considering recovery options

• We show that intermittent protection of DFC provides coverage below reasonable design targets, thus requiring additional protection and obviating the need for DFC

(especially in Leon due to expensive recovery)

• Hardening/parity proves to the most efficient generic solution, while combination with ABFT further improves efficiency for particular algorithms

27

Chapter 3

Cross-Layer Protection in an Accelerator

Architecture-Level

Cross-Layer

Protection

Existing Protection

Overhead

Characterization

Accelerator (FFT)

Cross-Layer

Protection

In preparation, 2015

28


Overview

Motivation

• Special-purpose accelerator is expected to be more vulnerable than a general-purpose core due to high utilization

• How much of the core area (flip-flops) needs to be protected for reasonable error rate targets?

• Accelerator could potentially benefit from a hardware implementation of Algorithm-Based Fault Tolerance (ABFT)

• Can this approach compete with hardware-level solutions

(hardening/parity/redundancy) ?

Contributions

• Analyze the FFT accelerator, an interesting case study, with respect to vulnerability and varying overhead for different designs

• Develop two versions of the hardware ABFT approach (weak and full) and analyze its overhead for different throughput capabilities

• Compare efficiency of hardening/parity and ABFT (and their cross-layer combination) as well as redundancy

29

Considered

Protection

Techniques

Logic

Hardening/Parity*

Redundancy

Algorithm-Based Fault

Tolerance (ABFT)

Memory

Parity, ECC


Methodology

Platform Choice

Case Study:

FFT Accelerator [7]

Asynchronous

Processing

1024-bit Integer Input

3 Components

2kB Input & Output

Arrays

Around 500 flip-flops

Protection


Implementation

Cell Library

(Hardening*)

Developed RTL Code

(Parity*, ABFT,

Redundancy)

Analysis


Coverage via flip-flop fault injection

Run in RTL Simulation


30


Why Fast Fourier Transform (FFT) Accelerator?

Hardware units of various capability can perform computation by iterating through the butterfly diagram

Source: C.S.Burrus, FFT Flowgraphs, 2009 [8]

Common 2x2 unit: processes 2 inputs, involves 1 complex multiplication and 2 complex additions

Source: J. A. Abraham, Fault-tolerant FFT networks, 1988 [9]

• FFT approximates a time-domain function with a series of sine waves

• One of the most popular accelerators, with a more complex structure

• Sensitive to errors due to data dependencies and sharing

• Has a strong and weak ABFT algorithm, with a varying range of overheads

31


Algorithm-Based Fault Tolerance (ABFT)

FFT result check by comparing first input point to the sum of decoded output points

To maintain throughput, in full ABFT, each

2x2 unit requires 2 encoders and 2 decoders

• ABFT checks only the end result of the multiple-point FFT butterfly computation, recovery is done via reply

• Full version requires encoders and decoders for each input to the

2x2 FFT unit; less can be used with performance loss

32


Injection and Other Resilience Techniques

Flip-flop-Level Fault Injection

• Similar fault injection methodology and flip-flop vulnerability ranking

• Custom implementation of the fault injector in RTL, tailored to the FFT core

Hardening, Parity

• Hardening adds redundancy to flip-flop cells for better immunity

• Parity adds a single check bits to bit word to very correctness on access, uses pipelining

Redundancy

• The most reliable protection by replicating entire core

• More than doubles the area overhead

(redundant copy + checker)

Pipelined Parity Redundancy

33

32


Resilience Area

FFT Core

• FFT core size varies depending on computation capability (half 2x2, full 2x2, two 2x2) and required SRAM array (multiport for 2x2 and higher)

• Full-size implementation of ABFT checker, that maintains throughput of the FFT core, is larger than redundancy (red line)

FFT Core + Resilience

(log scale)

Half 2x2 FFT core

[norm. to unprotected]

Full 2x2 FFT core

[norm. to unprotected]

34


ABFT Resilience Performance

FFT Core + Checker Configurations

Half 2x2 unit

(4 multipliers, 2 cycle latency, shared enc/dec)

Performance Impact

Full 2x2 unit

(2x4 multipliers, 1 cycle latency, separate enc/dec)

Duration of 1024-bit sequence computation

35


Individual Resilience Coverage

Distribution of injected faults.

Most are meaningful (OMM)

Protected vulnerable flip-flops, detected meaningful errors

• Most injected faults cause meaningful errors (output mismatch) due to high utilization, unlike in general-purpose cores

• ABFT does not cover RAM controller, requiring additional protection

• Weak ABFT detects 43% of errors, full ABFT 91%, redundancy 99.9%

36


Cross-Layer Resilience Coverage

Full ABFT (complemented with hardening/parity)

Is still inefficient, even with reduced checker area

Half 2x2 Core Full 2x2 Core

Weak ABFT complemented with hardening/parity is optimal

• Although partial, coverage of weak ABFT has low overhead, which allows achieving optimal protection when complemented with hardening/parity

• In spite of reducing the checker area in full ABFT, it still has a large overhead, unless further major detriment to performance is acceptable (full 2x2 core)

37


Chapter Summary

• We implemented ABFT protection for the FFT accelerator to compare against hardening/parity (and their cross-layer combinations ) as well as redundancy

• We illustrate increased vulnerability of the special-purpose

FFT unit, which translates to higher overhead of hardening/parity and improved ABFT efficiency

• We show that although it has partial coverage, only weak

ABFT provides optimal protection in combination with hardening/parity due to its low overhead (unlike full ABFT, even with a reduced checker)

38

Dissertation Summary

• In Chapter 1, we improve the understanding of relative protection overheads in common resilience solutions, and illustrate their efficiency in achieving comprehensive coverage

• In Chapter 2 and 3 we prove the dissertation hypothesis by illustrating benefits of combining hardening/parity with ABFT with our cross-layer analysis and low-level fault injection

• The low coverage of DFC and the high recovery overhead (in

Leon) make DFC uncompetitive against hardening/parity. Only the more efficient ABFT can lower the overhead of hardening/parity in an accelerator, that is more sensitive to overhead.

39

[1]

[2]

[3]

[4]

[5]

[6]

[6]

[7]

[8]

[9]

References

R. Baumann et al. Soft Errors in Advanced Computer Systems. IEEE Design and Test of Computers,

22(3):258-266, 2005.

N. Seifert et al. Soft Error Susceptibilities of 22 nm Tri-Gate Devices. IEEE Transactions on Nuclear

Science, 59:2666-2673, 2012.

N. George et al. Bit-slice logic interleaving for spatial multi-bit soft-error tolerance. International

Conference on Dependable Systems and Networks (DSN), 141-150.

OpenCores. OpenRISC 1200. http://opencores.org/or1k/Main_Page. 2013.

Aeroflex Gaisler. Leon3. http://www.gaisler.com/index.php/products/processors/leon3?task=view&id=13. 2014.

N. J. Wang et al. Characterizing the effects of transient faults on a high-performance processor pipeline. Dependable Systems and Networks, pp.61,70, 28 June-1 July 2004.

C. S. Burrus. FFT Flowgraphs. http://cnx.org/content/m16352/1.11/. 2009.

J. A. Abraham et al. Fault-tolerant FFT networks. IEEE Transactions on Computers, 548-561, 1988.

DigitalFIlter. FFT Lab. http://digitalfilter.com/enindex.html. 2014.

Wikipedia. Static random-access memory. Flip-flip. And Gate. http://www.wikipedia.org/. 2014.

40

Publications

 L. G. Szafaryn, E. Cheng, C. Y. Cher, M. Stan, S. Mitra and K. Skadron. "Hardware and Software Algorithm-based Resilience Tradeoffs in

Accelerators." In preparation, 2015.

 L. G. Szafaryn, E. Cheng, H. Cho, C. Y. Cher, S. Mirkhani, M. Stan, K. Skadron, S. Mitra and J. Abraham. "Cross-Layer Resilience

Exploration." parts submitted, prepared for another submission, 2015.

 S. Mitra, P. Bose, E. Cheng, C. Y. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. G.

Szafaryn. "The Resilience Wall: Cross-Layer Solution Strategies." In the International Symposium on VLSI Technology, Systems and

Applications (VLSI-TSA), April 2014.

 L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core." In IEEE Micro,

August 2013.

• L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Trellis: Portability across Architectures with a High-level Framework." In the Journal of Parallel and Distributed Computing (JPDC), August 2013.

• L. G. Szafaryn, B. Meyer and K. Skadron. "Evaluating Soft Error Protection Mechanisms in the Context of Multi-bit Errors at the Scope of a Processor." At SRC TECHCON, October 2012.

• L. G. Szafaryn, T. Gamblin, B. R. de Supinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous

Architectures." In the Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing

(WOLFHPC), in conjunction with the 25th International Conference on Supercomputing (ICS), May 2011.

• S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. “Characterization of the Rodinia Benchmark Suite with

Comparison to Contemporary CMP Workloads.” In the IEEE International Symposium on Workload Characterization (IISWC), December

2010.

• M. Guevara, P. Wu, M. D. Marino, J. Meng, L. G. Szafaryn, P. Satyamoorthy, B. Meyer, K. Skadron, J. Lach and B. Calhoun. "Exploiting

Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores." At SRC TECHCON 2010, September 2010.

• L. G. Szafaryn, K. Skadron, and J. J. Saucerman. "Experiences Accelerating MATLAB Systems Biology Applications." In the Workshop on

Biomedicine in Computing: Systems, Architectures, and Circuits (BiC), in conjunction with the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2009.

• L. G. Szafaryn, J. Saucerman, S. Acton, and K. Skadron. “A Tale of Two Systems Biology Applications: Experiences Accelerating

Ultrasound Feature Tracking vs. Multi-Timescale Simulations.” University of Virginia Dept. of Computer Science Technical Report CS-

2009-01, March 2009.

41

Questions

42

Pre-RTL On-chip Power Delivery Modeling and Analysis

Understanding and Optimizing

Heterogeneous Soft-Error Protection

Ph.D. Defense Presentation by Lukasz G. Szafaryn

Advisor:

Prof. Kevin Skadron

April 14

, 2015

Dissertation Outline

The Soft-Error Problem

Dissertation Scope

Chapter 1

Chapter 2

Chapter 3

Dissertation Summary

Sources

What are Soft Errors?

Vulnerability

Masking

What affects the soft-error rate?

Decrease

Increase

Existing Resilience Approaches

Specific Circuit/Component-Level Protection

General Core-Level Protection

Dissertation Scope

Problem:

Common resilience based on component redundancy or workload replication/restore has too much overhead

Hypothesis:

Achieving better coverage at lower cost will require analysis of other, cross-layer, solutions and their combinations

Chapter 1

Characterization of the Protection Overhead

Overview

Methodology

Protection Techniques

Case Study: OpenRISC Core

Component-Level Area

Core-Level Area

Chapter 1 Summary

Chapter 2

Resilience Optimization with

Cross-Layer Techniques

Overview

Methodology

Why

Data-Flow Checking (DFC) ?

Algorithm-Based Fault Tolerance (ABFT)

Injection and Other Resilience Techniques

Ultimate Consideration: Local Recovery

DFC Area

Leon:

IVM:

DFC Performance

DFC Coverage

Cross-Layer: Hardware-Architecture

Cross-Layer: Hardware-Architecture-Software

Chapter Summary

Chapter 3

Cross-Layer Protection in an Accelerator

Overview

Methodology

Why Fast Fourier Transform (FFT) Accelerator?

Algorithm-Based Fault Tolerance (ABFT)

Injection and Other Resilience Techniques

Resilience Area

ABFT Resilience Performance

Individual Resilience Coverage

Cross-Layer Resilience Coverage

Chapter Summary

Dissertation Summary

References

Publications

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib