AR-SMT: A Microarchitectural Approach to Fault Tolerance in

advertisement
A R-S M T: A M i c r o a r c h i te ctu ra l
A pproa c h t o Fa u l t To l e ra n ce in
Microprocessors
Eric Rotenberg
Computer Sciences Department
University of Wisconsin — Madison
http://www.cs.wisc.edu/~ericro/ericro.html
ericro@cs.wisc.edu
High-Per fo r m a n c e M i c r o p r o ce sso r s
• General-purpose computing success
- Proliferation of high-performance single-chip
microprocessors
• Rapid performance growth fueled by two trends
- Technology advances: circuit speed and density
- Microarchitecture innovations: exploiting instruction-level
parallelism (ILP) in sequential programs
• Both technology and microarchitecture performance trends
have implications to fault tolerance
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
2
Te ch n ol ogy and Faul t Tole ra n c e
• Technology-driven performance improvements
- GHz clock rates
- Billion-transistor chips
• High clock rate, dense designs
- low voltages for fast switching and power management
- high-performance and “undisciplined” circuit techniques
- managing clock skew with GHz clocks
- pushing the technology envelope potentially reduces
design tolerances in general
=> Entire chip prone to frequent, arbitrary transient faults
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
3
M ic roarc h i t e c t u r e a n d Fa u lt To le ra n ce
• Conventional fault-tolerant techniques
- Specialized techniques (e.g. ECC for memory, RESO for
ALUs) do not cover arbitrary logic faults
- Pervasive self-checking logic is intrusive to design
- System-level fault tolerance (e.g. redundant processors)
too costly for commodity computers
• A microarchitecture-based fault-tolerant approach
- Microarchitecture performance trends can be easily
leveraged for fault-tolerance goals
- Broad coverage of transient faults
- Low overhead: performance, area, and design changes
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
4
Ta l k O u t l i n e
✓Technology and microarchitecture trends: implications to
fault tolerance
• Time redundancy spectrum
• AR-SMT
• Leveraging microarchitecture trends
- Simultaneous Multithreading (SMT)
- Speculation concepts
- Hierarchical processors (e.g. Trace Processors)
• Performance results
• Summary and future work
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
5
Time R e d u n d a n c y S p ectr u m
Program-level
time redundancy
Processor
Processor
Program P
Program P’
Processor
dynamic
parallel
scheduling execution units
Program P
FU
Instruction re-execution
i
instr i
i’
FU
FU
FU
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
6
Time R e d u n d a n c y S p ectr u m
Processor
Program P
AR-SMT
time redundancy
Eric Rotenberg
Program P’
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
7
AR - S M T H i g h L eve l
R-stream
A-stream
PROCESSOR
fetch
commit
R-stream
A-stream
DELAY BUFFER
• “A” => “Active stream”
• “R” => “Redundant stream”
• “SMT” => “Simultaneous MultiThreading”
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
8
A R- S M T Fa u l t To l e ra n ce
• Delay Buffer
- Simple, fast, hardware-only state passing for comparing
thread state
- Ensures time redundancy: the A- and R-stream copies of
an instruction execute at different times
- Buffer length adjusted to cover transient fault lifetimes
• Transient fault detection and recovery
- Fault detected when thread state does not match
- Error latency related to length of Delay Buffer
- Committed R-stream state is checkpoint for recovery
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
9
Leveragin g M i c r o a r c h i t e c t ur e Tr e n d s
• Simultaneous Multithreading
• Control Flow and Data Flow Prediction
• Hierarchy and Replication
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
10
SMT
• Simultaneous multithreading
[Tullsen,Eggers,Emer,Levy,Lo,Stamm: ISCA-23]
- Multiple contexts space-share a wide-issue superscalar
processor
- Key ideas
• Performance: better overall utilization of highly parallel
uniprocessor
• Complexity: leverage well-understood superscalar
mechanisms to make multiple contexts transparent
• SMT is (most likely) in next-generation product plans
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
11
Cont ro l a n d D a t a “ P r e d ictio n ”
cycle
cycle
r1 <= r6-11
r2 <= r1<<2
1 i1 i2 i3 i4 i5
1 i1
r3 <= r1+r2
2
2 i2
branch i5,r3==0
3
3 i3
1
0
4
4 i4
11
00
5
5
11
00
6
6 pipeline latency
i5: r5 <= 10
1
0
7
7
8
8
11
00
9
9 i5
11
00
i1:
i2:
i3:
i4:
(a) program segment
(b) base schedule
(c) with prediction
• Delay Buffer contains perfect “predictions” for R-stream!
• Existing prediction-validation hardware provides fault detection!
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
12
Tra c e P r o c e s s o r s
• Trace: a long (16 to 32) dynamic sequence of instructions, and the
fundamental unit of operation in Trace Processors
• Replicated PEs => inherent, coarse level of hardware redundancy
- Permanent fault coverage: A-/R-stream traces go to different PEs
- Simple re-configuration: remove a PE from the resource pool
Processing Element 0
Global
Registers
Predicted
Local
Registers
Registers
Issue Buffers
Processing Element 1
Trace
Cache
Live-in
Value
Predict
Processing Element 2
Speculative
State
Next
Trace
Predict
Data
Cache
Func
Units
Global
Rename
Maps
Processing Element 3
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
13
A R-S M T o n a Tra c e P r o ce sso r
FETCH
DISPATCH
ISSUE
EXECUTE
RETIRE
PE
A-stream
Trace
Predictor
PE
A-stream
A-stream
Trace
Cache
R-stream
Trace "Prediction"
from Delay Buffer
Decode
Rename
Allocate PE
PE
R-stream
Eric Rotenberg
Reclaim
Resources
PE
A-stream
Reg File
time-shared
Commit
State
D-cache
Disambig. Unit
space-shared
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
14
Pe r fo r m a n c e R e s ults
normalized execution time
(single thread = 1)
1.3
4 PE
8 PE
1.2
1.1
1
comp
Eric Rotenberg
gcc
go
jpeg
li
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
15
Pe r fo r m a n c e R e s ults
% of all cycles
30
A-stream
R-stream
25
20
15
10
5
0
0
1
2
3
4
5
6
7
8
number of PEs used
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
16
Summary
• Technology-driven performance improvements
- New fault environment: frequent, arbitrary transient faults
• Leverage microarchitecture performance trends for broadcoverage, low-overhead fault tolerance
- SMT-based time redundancy
- Control and data “prediction”
- Hierarchical processors
• Introducing a second, redundant thread increases execution
time by only 10% to 30%
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
17
O t her I n t e r e s t i n g Pe r s pe ctive s
• D. Siewiorek. “Niche Successes to Ubiquitous Invisibility:
Fault-Tolerant Computing Past, Present, and Future”,
FTCS-25
- (Quote) Fault-tolerant architectures have not kept pace
with the rate of change in commercial systems.
- Fault tolerance must make unconventional in-roads into
commodity processors: leverage the commodity
microarchitecture.
• P. Rubinfeld. “Managing Problems at High Speeds”, Virtual
Roundtable on the Challenges and Trends in Processor
Design, Computer, Jan. 1998.
- Implications of very high clock rate, dense designs
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
18
F u t u r e Wo r k
• Performance and design studies
- Isolate individual performance impact of SMT-ness and
prediction aspects
- Debug Buffer size vs. performance: SMT scheduling
flexibility
- Support more threads
- Design and validate a full AR-SMT fault-tolerant system
• Derive fault models (collaboration?)
• Simulate fault coverage
- fault injection
- test different parts of processor: coverage varies?
- vary duration of transient faults
- permanent faults and re-configuration
Eric Rotenberg
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
Slide
19
Download