On Cosmic Rays, Bat Droppings, and what to do about them

advertisement
On
Cosmic Rays,
Bat Droppings,
and what to do about them
David Walker
Princeton University
with Jay Ligatti, Lester Mackey, George Reis and David August
A Little-Publicized Fact
1 + 1 = 23
How do Soft Faults Happen?
“Solar
Particles”
Affect
Satellites;
Cause < 5% of
Terrestrial problems
“Galactic Particles”
Are high-energy particles that
penetrate to Earth’s surface, through
buildings and walls
Alpha particles from
bat droppings


High-energy particles pass through devices and collides with silicon
atom
Collision generates an electric charge that can flip a single bit
How Often do Soft Faults Happen?
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study; Mainframes; 83-86
City Altitude (feet)
12000
Leadville, CO
10000
8000
Denver, CO
6000
4000
Tucson, AZ
2000
0
NYC
0
5
10
Cosmic ray flux/fail rate (multiplier)
15
How Often do Soft Faults Happen?
IBM Soft Fail Rate Study; Mainframes; 83-86 [Zeiger-Puchner 2004]
City Altitude (feet)
12000
Leadville, CO
10000
8000
Denver, CO
6000
4000
Tucson, AZ
2000
0
NYC
0
5
10
15
Cosmic ray flux/fail rate (multiplier)
Some Data Points:
• 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days
• 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months
• 2004: 1 fail/year for laptop with 1GB ram at sea-level
• 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004]
How Often do Soft Faults Happen?
Relative Soft Error Rate Increase
Soft Error Rate Trends
[Shenkhar Borkar, Intel, 2004]
150
~8% degradation/bit/generation
100
50
6 years
from now
0
180
130
90
65
45
32
22
16
Chip Feature Size
we are
approximately
here
How Often do Soft Faults Happen?
Relative Soft Error Rate Increase
Soft Error Rate Trends
[Shenkhar Borkar, Intel, 2004]
150
~8% degradation/bit/generation
100
50
6 years
from now
0
180
130
90
65
45
32
22
16
Chip Feature Size
• Soft error rates go up as:
• Voltages decrease
• Feature sizes decrease
• Transistor density increases
• Clock rates increase
all future
manufacturing
trends
we are
approximately
here
How Often do Soft Faults Happen?

In 1948, Presper Eckert notes that cascading effects of a single-bit
error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004]

In 2000, Sun server systems deployed to America Online, eBay, and
others crashed due to cosmic rays [Baumann 2002]

“The wake-up call came in the end of 2001 ... billion-dollar factory
ground to a halt every month due to ... a single bit flip” [Zeiger-Puchner 2004]

Los Alamos National Lab Hewlett-Packard ASC Q 2048-node
supercomputer was crashing regularly from soft faults due to cosmic
radiation [Michalak 2005]
What Problems do Soft Faults Cause?

a single bit in memory gets flipped

a single bit in the processor logic gets flipped and



there’s no difference in external observable behavior
the processor locks up
the computation is silently corrupted



register value corrupted (simple data fault)
control-flow transfer goes to wrong place (control-flow fault)
different opcode interpreted (instruction fault)
FT Solutions

Redundancy in Information




Redundancy in Space





eg: Error correcting codes (ECC)
pros: protects stored values efficiently
cons: difficult to design for arithmetic units and control logic
multiple redundant hardware devices
eg: Compaq Non-stop Himalaya runs two identical programs on two
processors, comparing pins on every cycle
pros: efficient in time
cons: expensive in hardware (double the space)
Redundancy in Time



perform the same computations at different times (eg: in sequence)
pros: efficient in hardware (space is reused)
cons: expensive in time (slower --- but not twice as slow)
Solutions in Time
Compiler generates code containing replicated
computations, fault detection checks and recovery
routines

eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05],
...


pros: software-controlled --- new code with better reliability
properties may be deployed whenever, wherever needed
cons: for fixed reliability policy, slower than specialized
hardware solutions
Solutions in Time
Compiler generates code containing replicated
computations, fault detection checks and recovery
routines

eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05],
...



pros: flexibility --- new code with better reliability properties
may be deployed whenever, wherever needed
cons: for fixed reliability policy, slower than specialized
hardware solutions
cons: it might not actually work
“It might not actually work”
Agenda


Answer basic scientific questions about softwarecontrolled fault tolerance:

Do software-only or hybrid SW/HW techniques actually work?

For what fault models? How do we specify them?

How can we prove it?
Build compilers that produce software that runs reliably
on faulty hardware


Moreover: Let’s not replace faulty hardware with faulty software.
Let’s prove every binary we produce is fault tolerant relative to
the specified fault model
A Possible Compiler Architecture
compiler front end
ordinary
program
reliability transform
fault tolerant
program
optimization
optimized
FT program
A Possible Compiler Architecture
compiler front end
ordinary
program
reliability transform
Testing Requirements:
all combinations of features
multiplied by
all combinations of faults
fault tolerant
program
optimization
optimized
FT program
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
fault tolerant
program
reliability
proof
optimization
optimized
FT program
modified
proof
proof
checker
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
Testing Requirements:
fault tolerant
program
reliability
proof
all combinations of features
multiplied by
all combinations of faults
optimization
optimized
FT program
modified
proof
proof
checker
Central Technical Challenges

Designing
Step 1: Lambda Zap

Lambda Zap [ICFP 06]


a lambda calculus that exhibits intermittent data faults + operators
to detect and correct them
a type system that guarantees observable outputs of well-typed
programs do not change in the presence of a single fault



types act as the “proofs” of fault tolerance
expressive enough to implement an ordinary typed lambda calculus
End result:

the foundation for a fault-tolerant typed intermediate language
The Fault Model

Lambda zap models simple data faults only
( M, F[ v1 ] )

Not modelled:




---> ( M, F[ v2 ] )
memory faults (better protected using ECC hardware)
control-flow faults (ie: faults during control-flow transfer)
instruction faults (ie: faults in instruction opcodes)
Goal: to construct programs that tolerate 1 fault

observers cannot distinguish between fault-free and 1-fault runs
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
replicate
instructions
atomic majority vote
+ output
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 7 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
Lambda to Lambda Zap: The main idea
let x = 2 in
let y = x + x in
out y
let x1 = 2 in
let x2 = 2 in
let x3 = 7 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
corrupted values
copied and percolate
through computation
but final output
unchanged
Lambda to Lambda Zap: Control-flow
recursively translate
subexpressions
let x = 2 in
if x then e1 else e2
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]
majority vote on
control-flow transfer
Lambda to Lambda Zap: Control-flow
recursively translate
subexpressions
let x = 2 in
if x then e1 else e2
(function calls replicate arguments,
results and function itself)
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]
majority vote on
control-flow transfer
Almost too easy,
can anything go wrong?...
Almost too easy,
can anything go wrong?...
yes!
optimization reduces replication overhead
dramatically (eg: ~ 43% for 2 copies),
but can be unsound!
original implementation of SWIFT [Reis et al.]
optimized away all redundancy leaving them
with an unreliable implementation!!
Faulty Optimizations
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
CSE
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
In general, optimizations eliminate redundancy,
fault-tolerance requires redundancy.
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
common value x1
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
common value x1
good code:
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
voters do not depend on
a common value
The Essential Problem
bad code:
let x1 = 2 in
let y1 = x1 + x1 in
out [y1, y1, y1]
voters depend on
a common value
good code:
let x1 = 2 in
let x2 = 2 in
let x3 = 2 in
let y1 = x1 + x1 in
let y2 = x2 + x2 in
let y3 = x3 + x3 in
out [y1, y2, y3]
voters do not depend on
a common value
(red on red;
green on green;
blue on blue)
A Type System for Lambda Zap

Key idea: types track the “color” of the underlying value
& prevents interference between colors
Colors C ::= R | G | B
Types T ::= C int | C bool | C (T1,T2,T3)  (T1’,T2’,T3’)
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
simple value typing rules:
(x : T) in G
--------------G |--z x : T
-----------------------G |--z C n : C int
-----------------------------G |--z C true : C bool
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
sample expression typing rules:
G |--z e1 : C int
G |--z e2 : C int
------------------------------------------------G |--z e1 + e2 : C int
G |--z e1 : R int
G |--z e2 : G int
G |--z e3 : B int
G |--z e4 : T
-----------------------------------G |--z out [e1, e2, e3]; e4 : T
G |--z e1 : R bool
G |--z e2 : G bool
G |--z e3 : B bool
G |--z e4 : T
G |--z e5 : T
----------------------------------------------------G |--z if [e1, e2, e3] then e4 else e5 : T
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
recall “zap rule” from operational semantics:
( M, F[ v1 ] ) ---> ( M, F[ v2 ] )
before:
|-- v1 : T
after:
|-- v2 ?? T
==> how will we obtain type preservation?
Sample Typing Rules
Judgement Form:
G |--z e : T
where z ::= C | .
recall “zap rule” from operational semantics:
( M, F[ v1 ] ) ---> ( M, F[ v2 ] )
before:
no conditions
|-- v1 : C U
after:
|--C v2 : C U
by rule:
---------------------G |--C C v : C U
“faulty typing”
occurs within
a single color
only.
Theorems

Theorem 1: Well-typed programs are safe, even when
there is a single error.

Theorem 2: Well-typed programs executing with a
single error simulate the output of well-typed programs
with no errors [with a caveat].

Theorem 3: There is a correct, type-preserving
translation from the simply-typed lambda calculus into
lambda zap [that satisfies the caveat].
ICFP 06
The Caveat
Goal: 0-fault and 1-fault executions should be indistinguishable
bad, but well-typed code:
out [2, 3, 3]
outputs 3 after no faults
out [2, 3, 3]
out [2, 2, 3]
outputs 2 after 1 fault
More importantly:
out [2, 3, 3] is obviously a symptom of a compiler bug
out [2, 3, 4] is even worse – good runs never come to consensus
Solution: computations must independent, but equivalent
The Caveat
modified typing:
G |--z e1 : R U
G |--z e2 : G U
G |--z e3 : B U
G |--z e4 : T
G |--z e1 ~~ e2
G |--z e2 ~~ e3
---------------------------------------------------------------------------G |-- out [e1, e2, e3]; e4 : T
The Caveat
More generally, programmers may form “triples” of equivalent values
Introduction form:
[e1, e2, e3]
• a collection of 3 items
• each of 3 stored in separate register
• single fault effects at most one
Elimination form:
let [x1, x2, x3] = e1 in e2
The Caveat
More generally, programmers may form “triples” of equivalent values
Introduction form:
G |--z e1 : R U
G |--z e2 : G U
G |--z e3 : B U
G |--z e1 ~~ e2
G |--z e2 ~~ e3
--------------------------------------------G |-- [e1, e2, e3] : [R U, G U, B U]
Elimination form:
G |--z e1 : [R U, G U, B U]
G, x1:R U, x2:G U, x3:B U,
x1 ~ x2, x2 ~ x3 |--z e2 : T
--------------------------------------------G |-- let [x1, x2, x3] = e1 in e2 : T
Theorems*

Theorem 1: Well-typed programs are safe, even when
there is a single error.

Theorem 2: Well-typed programs executing with a
single error simulate the output of well-typed programs
with no errors.

Theorem 3: There is a correct, type-preserving
translation from the simply-typed lambda calculus into
lambda zap.
* There is still one “i” to be dotted in the proofs of these theorems.
Lester Mackey, brilliant Princeton undergrad, has proven all key theorems
modulo the dotted “i”.
Step 2: Fault Tolerant
Typed Assembly Language (TAL/FT)


Lambda zap is playground for studying the principles of
fault tolerance in an idealized setting
TAL/FT is a more realistic assembly-level, hybrid HW/SW,
fault tolerance scheme with
(1) a formal fault model
(2) a formal definition of fault tolerance relative to memorymapped I/O
(3) a sound type system for proving compiled programs
are fault tolerant
A More Reliable Compiler Architecture
compiler front end
ordinary
program
reliability transform
reliability
proof
TAL/FT
types
optimization
types
optimized
TAL/FT
modified
proof
type
proof
checker
TAL/FT: Key Ideas (Fault Model)

Fault model:



registers may incur arbitrary faults in between
execution of any two instructions
memory (including code) is protected by ECC
fault model formalized as part of hardware
operational semantics
TAL/FT: Key Ideas (Properties)
store
Processor
read
ECC-protected
memory
Mem-mapped
I/O device
Primary Goal: if there is one fault then either
(1) Mem-mapped I/O device sees exactly the same sequence
of stores as a fault-free execution, or
(2) Hardware detects and signals a fault and mem-mapped I/O
sees a prefix of the stores from a fault-free execution
Secondary Goal: no false positives
TAL/FT: Key Ideas (Mechanisms)

Compiler strategy

create two redundant computations as lambda zap



Hardware support



two copies ==> fault detection
fault recovery handled by a higher-level process
special instructions & modified store buffer for implementing reliable stores
special instructions for reliable control-flow transfers
Type system



Simple typing mechanisms based on original TAL [Morrisett, Walker, et al.]
Redundant values with separate colors like in lambda zap
Value identities needed for equivalence checking tracked using singleton
types combined with some ideas drawn from traditional Hoare logics
Current & Future Work

Build the first compiler that can automatically generate
reliability proofs for compiled programs



Study alternative fault detection schemes



TAL/FT refinement and implementation
type- and reliability-preserving optimizations
fault detection & recovery on current hardware
exploit multi-core alternatives
Understand the foundational theoretical principles that allow
programs to tolerate transient faults

general purpose program logics for reasoning about faulty programs
Other Research I Do

PADS [popl 06, sigmod 06 demo, popl 07]



Program Monitoring [popl 00, icfp 03, pldi 05, popl 06, ...]


automatic generation of tools (parsers, printers, validators, format
translators, query engines, etc.) for “ad hoc” data formats
with Kathleen Fisher (AT&T)
semantics, design and implementation of programs that monitor
other programs for security (or other purposes)
TAL & other type systems [popl 98, popl 99, toplas 99, jfp 02, ... ]

theory, design and implementation of type systems for compiler
target and intermediate languages
Conclusions
Semi-conductor manufacturers are
deeply worried about how to
deal with soft faults in future
architectures (10+ years out)
Using proofs and types I think we
are going to be able to develop
highly reliable software that runs
on unreliable hardware
end!
The Caveat
Function O.S. follows
Lambda to Lambda Zap: Control-flow
let f = \x.e in
f2
let [f1, f2, f3] = \x. [[ e ]] in
[f1, f2, f3] [2, 2, 2]
majority vote on
control-flow transfer
Lambda to Lambda Zap: Control-flow
let f = \x.e in
f2
let [f1, f2, f3] = \x. [[ e ]] in
[f1, f2, f3] [2, 2, 2]
operational semantics:
(M; let [f1, f2, f3] = \x.e1 in e2)
--->
(M,l=\x.e1; e2[ l / f1][ l / f2][ l / f3])
majority vote on
control-flow transfer
TAL/FT Hardware
replicated
program
counters
store queue
ECC-protected
Caches/Memory
Related Work Follows
Software Mitigation Techniques

Examples:
 N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], ...
 Hybrid hardware-software techniques: Watchdog Processors,
CRAFT [Reis et al. 2005] , ...

Pros:

immediate deployment




would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment, application
reduced hardware cost
Cons:

For the same universal policy, slower (but not as much as you’d think).
Software Mitigation Techniques

Examples:
 N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al.
2005], etc...
 Hybrid hardware-software techniques: Watchdog Processors,
CRAFT [Reis et al. 2005] , etc...

Pros:

immediate deployment: if your system is suffering soft error-related
failures, you may deploy new software immediately




would have benefitted Los Alamos Labs, etc...
policies may be customized to the environment, application
reduced hardware cost
Cons:


For the same universal policy, slower (but not as much as you’d think).
IT MIGHT NOT ACTUALLY WORK!
Mitigation Techniques
Hardware:
 error-correcting codes
 redundant hardware
Pros:

Pros:

Software and hybrid schemes:
 replicate computations

fast for a fixed policy

Cons:



FT policy decided at
hardware design time
 mistakes cost millions
one-size-fits-all policy
expensive
immediate deployment
policies customized to
environment, application
reduced hardware cost
Cons:

for the same universal policy,
slower (but not as much as you’d
think).
Mitigation Techniques
Hardware:
 error-correcting codes
 redundant hardware
Pros:

Pros:

Software and hybrid schemes:
 replicate computations

fast for fixed policy

Cons:



FT policy decided at
hardware design time
 mistakes cost millions
one-size-fits-all policy
expensive
immediate deployment
policies customized to
environment, application
reduced hardware cost
Cons:


for the same universal policy,
slower (but not as much as you’d
think).
It may not actually work!

much research in HW/compilers
community completely lacking
proof
Solutions in Time

Solutions in Hardware






replication of instructions and checking implemented in specialpurpose hardware
eg: Reinhardt & Mukherjee 2000
pros: transparent to software
cons: one-size-fits-all reliability policy
cons: can’t fix existing problem; specialized hardware has reduced
market
Solutions in Software (or hybrid Hardware/Software)





compiler generates replicated instructions and checking code
eg: Reis et al. 05
pros: flexibility: new reliability policies may be deployed whenever
needed
cons: for fixed reliability policy, slower than specialized hardware
solutions
cons: it might not actually work
Download