On Cosmic Rays, Bat Droppings, and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August A Little-Publicized Fact 1 + 1 = 23 How do Soft Faults Happen? “Solar Particles” Affect Satellites; Cause < 5% of Terrestrial problems “Galactic Particles” Are high-energy particles that penetrate to Earth’s surface, through buildings and walls Alpha particles from bat droppings High-energy particles pass through devices and collides with silicon atom Collision generates an electric charge that can flip a single bit How Often do Soft Faults Happen? How Often do Soft Faults Happen? IBM Soft Fail Rate Study; Mainframes; 83-86 City Altitude (feet) 12000 Leadville, CO 10000 8000 Denver, CO 6000 4000 Tucson, AZ 2000 0 NYC 0 5 10 Cosmic ray flux/fail rate (multiplier) 15 How Often do Soft Faults Happen? IBM Soft Fail Rate Study; Mainframes; 83-86 [Zeiger-Puchner 2004] City Altitude (feet) 12000 Leadville, CO 10000 8000 Denver, CO 6000 4000 Tucson, AZ 2000 0 NYC 0 5 10 15 Cosmic ray flux/fail rate (multiplier) Some Data Points: • 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days • 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months • 2004: 1 fail/year for laptop with 1GB ram at sea-level • 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004] How Often do Soft Faults Happen? Relative Soft Error Rate Increase Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] 150 ~8% degradation/bit/generation 100 50 6 years from now 0 180 130 90 65 45 32 22 16 Chip Feature Size we are approximately here How Often do Soft Faults Happen? Relative Soft Error Rate Increase Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] 150 ~8% degradation/bit/generation 100 50 6 years from now 0 180 130 90 65 45 32 22 16 Chip Feature Size • Soft error rates go up as: • Voltages decrease • Feature sizes decrease • Transistor density increases • Clock rates increase all future manufacturing trends we are approximately here How Often do Soft Faults Happen? In 1948, Presper Eckert notes that cascading effects of a single-bit error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004] In 2000, Sun server systems deployed to America Online, eBay, and others crashed due to cosmic rays [Baumann 2002] “The wake-up call came in the end of 2001 ... billion-dollar factory ground to a halt every month due to ... a single bit flip” [Zeiger-Puchner 2004] Los Alamos National Lab Hewlett-Packard ASC Q 2048-node supercomputer was crashing regularly from soft faults due to cosmic radiation [Michalak 2005] What Problems do Soft Faults Cause? a single bit in memory gets flipped a single bit in the processor logic gets flipped and there’s no difference in external observable behavior the processor locks up the computation is silently corrupted register value corrupted (simple data fault) control-flow transfer goes to wrong place (control-flow fault) different opcode interpreted (instruction fault) FT Solutions Redundancy in Information Redundancy in Space eg: Error correcting codes (ECC) pros: protects stored values efficiently cons: difficult to design for arithmetic units and control logic multiple redundant hardware devices eg: Compaq Non-stop Himalaya runs two identical programs on two processors, comparing pins on every cycle pros: efficient in time cons: expensive in hardware (double the space) Redundancy in Time perform the same computations at different times (eg: in sequence) pros: efficient in hardware (space is reused) cons: expensive in time (slower --- but not twice as slow) Solutions in Time Compiler generates code containing replicated computations, fault detection checks and recovery routines eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05], ... pros: software-controlled --- new code with better reliability properties may be deployed whenever, wherever needed cons: for fixed reliability policy, slower than specialized hardware solutions Solutions in Time Compiler generates code containing replicated computations, fault detection checks and recovery routines eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05], ... pros: flexibility --- new code with better reliability properties may be deployed whenever, wherever needed cons: for fixed reliability policy, slower than specialized hardware solutions cons: it might not actually work “It might not actually work” Agenda Answer basic scientific questions about softwarecontrolled fault tolerance: Do software-only or hybrid SW/HW techniques actually work? For what fault models? How do we specify them? How can we prove it? Build compilers that produce software that runs reliably on faulty hardware Moreover: Let’s not replace faulty hardware with faulty software. Let’s prove every binary we produce is fault tolerant relative to the specified fault model A Possible Compiler Architecture compiler front end ordinary program reliability transform fault tolerant program optimization optimized FT program A Possible Compiler Architecture compiler front end ordinary program reliability transform Testing Requirements: all combinations of features multiplied by all combinations of faults fault tolerant program optimization optimized FT program A More Reliable Compiler Architecture compiler front end ordinary program reliability transform fault tolerant program reliability proof optimization optimized FT program modified proof proof checker A More Reliable Compiler Architecture compiler front end ordinary program reliability transform Testing Requirements: fault tolerant program reliability proof all combinations of features multiplied by all combinations of faults optimization optimized FT program modified proof proof checker Central Technical Challenges Designing Step 1: Lambda Zap Lambda Zap [ICFP 06] a lambda calculus that exhibits intermittent data faults + operators to detect and correct them a type system that guarantees observable outputs of well-typed programs do not change in the presence of a single fault types act as the “proofs” of fault tolerance expressive enough to implement an ordinary typed lambda calculus End result: the foundation for a fault-tolerant typed intermediate language The Fault Model Lambda zap models simple data faults only ( M, F[ v1 ] ) Not modelled: ---> ( M, F[ v2 ] ) memory faults (better protected using ECC hardware) control-flow faults (ie: faults during control-flow transfer) instruction faults (ie: faults in instruction opcodes) Goal: to construct programs that tolerate 1 fault observers cannot distinguish between fault-free and 1-fault runs Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] replicate instructions atomic majority vote + output Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] corrupted values copied and percolate through computation but final output unchanged Lambda to Lambda Zap: Control-flow recursively translate subexpressions let x = 2 in if x then e1 else e2 let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] majority vote on control-flow transfer Lambda to Lambda Zap: Control-flow recursively translate subexpressions let x = 2 in if x then e1 else e2 (function calls replicate arguments, results and function itself) let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] majority vote on control-flow transfer Almost too easy, can anything go wrong?... Almost too easy, can anything go wrong?... yes! optimization reduces replication overhead dramatically (eg: ~ 43% for 2 copies), but can be unsound! original implementation of SWIFT [Reis et al.] optimized away all redundancy leaving them with an unreliable implementation!! Faulty Optimizations let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] CSE let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] In general, optimizations eliminate redundancy, fault-tolerance requires redundancy. The Essential Problem bad code: let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on common value x1 The Essential Problem bad code: let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on common value x1 good code: let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] voters do not depend on a common value The Essential Problem bad code: let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on a common value good code: let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] voters do not depend on a common value (red on red; green on green; blue on blue) A Type System for Lambda Zap Key idea: types track the “color” of the underlying value & prevents interference between colors Colors C ::= R | G | B Types T ::= C int | C bool | C (T1,T2,T3) (T1’,T2’,T3’) Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . simple value typing rules: (x : T) in G --------------G |--z x : T -----------------------G |--z C n : C int -----------------------------G |--z C true : C bool Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . sample expression typing rules: G |--z e1 : C int G |--z e2 : C int ------------------------------------------------G |--z e1 + e2 : C int G |--z e1 : R int G |--z e2 : G int G |--z e3 : B int G |--z e4 : T -----------------------------------G |--z out [e1, e2, e3]; e4 : T G |--z e1 : R bool G |--z e2 : G bool G |--z e3 : B bool G |--z e4 : T G |--z e5 : T ----------------------------------------------------G |--z if [e1, e2, e3] then e4 else e5 : T Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . recall “zap rule” from operational semantics: ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: |-- v1 : T after: |-- v2 ?? T ==> how will we obtain type preservation? Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . recall “zap rule” from operational semantics: ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: no conditions |-- v1 : C U after: |--C v2 : C U by rule: ---------------------G |--C C v : C U “faulty typing” occurs within a single color only. Theorems Theorem 1: Well-typed programs are safe, even when there is a single error. Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors [with a caveat]. Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap [that satisfies the caveat]. ICFP 06 The Caveat Goal: 0-fault and 1-fault executions should be indistinguishable bad, but well-typed code: out [2, 3, 3] outputs 3 after no faults out [2, 3, 3] out [2, 2, 3] outputs 2 after 1 fault More importantly: out [2, 3, 3] is obviously a symptom of a compiler bug out [2, 3, 4] is even worse – good runs never come to consensus Solution: computations must independent, but equivalent The Caveat modified typing: G |--z e1 : R U G |--z e2 : G U G |--z e3 : B U G |--z e4 : T G |--z e1 ~~ e2 G |--z e2 ~~ e3 ---------------------------------------------------------------------------G |-- out [e1, e2, e3]; e4 : T The Caveat More generally, programmers may form “triples” of equivalent values Introduction form: [e1, e2, e3] • a collection of 3 items • each of 3 stored in separate register • single fault effects at most one Elimination form: let [x1, x2, x3] = e1 in e2 The Caveat More generally, programmers may form “triples” of equivalent values Introduction form: G |--z e1 : R U G |--z e2 : G U G |--z e3 : B U G |--z e1 ~~ e2 G |--z e2 ~~ e3 --------------------------------------------G |-- [e1, e2, e3] : [R U, G U, B U] Elimination form: G |--z e1 : [R U, G U, B U] G, x1:R U, x2:G U, x3:B U, x1 ~ x2, x2 ~ x3 |--z e2 : T --------------------------------------------G |-- let [x1, x2, x3] = e1 in e2 : T Theorems* Theorem 1: Well-typed programs are safe, even when there is a single error. Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors. Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap. * There is still one “i” to be dotted in the proofs of these theorems. Lester Mackey, brilliant Princeton undergrad, has proven all key theorems modulo the dotted “i”. Step 2: Fault Tolerant Typed Assembly Language (TAL/FT) Lambda zap is playground for studying the principles of fault tolerance in an idealized setting TAL/FT is a more realistic assembly-level, hybrid HW/SW, fault tolerance scheme with (1) a formal fault model (2) a formal definition of fault tolerance relative to memorymapped I/O (3) a sound type system for proving compiled programs are fault tolerant A More Reliable Compiler Architecture compiler front end ordinary program reliability transform reliability proof TAL/FT types optimization types optimized TAL/FT modified proof type proof checker TAL/FT: Key Ideas (Fault Model) Fault model: registers may incur arbitrary faults in between execution of any two instructions memory (including code) is protected by ECC fault model formalized as part of hardware operational semantics TAL/FT: Key Ideas (Properties) store Processor read ECC-protected memory Mem-mapped I/O device Primary Goal: if there is one fault then either (1) Mem-mapped I/O device sees exactly the same sequence of stores as a fault-free execution, or (2) Hardware detects and signals a fault and mem-mapped I/O sees a prefix of the stores from a fault-free execution Secondary Goal: no false positives TAL/FT: Key Ideas (Mechanisms) Compiler strategy create two redundant computations as lambda zap Hardware support two copies ==> fault detection fault recovery handled by a higher-level process special instructions & modified store buffer for implementing reliable stores special instructions for reliable control-flow transfers Type system Simple typing mechanisms based on original TAL [Morrisett, Walker, et al.] Redundant values with separate colors like in lambda zap Value identities needed for equivalence checking tracked using singleton types combined with some ideas drawn from traditional Hoare logics Current & Future Work Build the first compiler that can automatically generate reliability proofs for compiled programs Study alternative fault detection schemes TAL/FT refinement and implementation type- and reliability-preserving optimizations fault detection & recovery on current hardware exploit multi-core alternatives Understand the foundational theoretical principles that allow programs to tolerate transient faults general purpose program logics for reasoning about faulty programs Other Research I Do PADS [popl 06, sigmod 06 demo, popl 07] Program Monitoring [popl 00, icfp 03, pldi 05, popl 06, ...] automatic generation of tools (parsers, printers, validators, format translators, query engines, etc.) for “ad hoc” data formats with Kathleen Fisher (AT&T) semantics, design and implementation of programs that monitor other programs for security (or other purposes) TAL & other type systems [popl 98, popl 99, toplas 99, jfp 02, ... ] theory, design and implementation of type systems for compiler target and intermediate languages Conclusions Semi-conductor manufacturers are deeply worried about how to deal with soft faults in future architectures (10+ years out) Using proofs and types I think we are going to be able to develop highly reliable software that runs on unreliable hardware end! The Caveat Function O.S. follows Lambda to Lambda Zap: Control-flow let f = \x.e in f2 let [f1, f2, f3] = \x. [[ e ]] in [f1, f2, f3] [2, 2, 2] majority vote on control-flow transfer Lambda to Lambda Zap: Control-flow let f = \x.e in f2 let [f1, f2, f3] = \x. [[ e ]] in [f1, f2, f3] [2, 2, 2] operational semantics: (M; let [f1, f2, f3] = \x.e1 in e2) ---> (M,l=\x.e1; e2[ l / f1][ l / f2][ l / f3]) majority vote on control-flow transfer TAL/FT Hardware replicated program counters store queue ECC-protected Caches/Memory Related Work Follows Software Mitigation Techniques Examples: N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], ... Hybrid hardware-software techniques: Watchdog Processors, CRAFT [Reis et al. 2005] , ... Pros: immediate deployment would have benefitted Los Alamos Labs, etc... policies may be customized to the environment, application reduced hardware cost Cons: For the same universal policy, slower (but not as much as you’d think). Software Mitigation Techniques Examples: N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], etc... Hybrid hardware-software techniques: Watchdog Processors, CRAFT [Reis et al. 2005] , etc... Pros: immediate deployment: if your system is suffering soft error-related failures, you may deploy new software immediately would have benefitted Los Alamos Labs, etc... policies may be customized to the environment, application reduced hardware cost Cons: For the same universal policy, slower (but not as much as you’d think). IT MIGHT NOT ACTUALLY WORK! Mitigation Techniques Hardware: error-correcting codes redundant hardware Pros: Pros: Software and hybrid schemes: replicate computations fast for a fixed policy Cons: FT policy decided at hardware design time mistakes cost millions one-size-fits-all policy expensive immediate deployment policies customized to environment, application reduced hardware cost Cons: for the same universal policy, slower (but not as much as you’d think). Mitigation Techniques Hardware: error-correcting codes redundant hardware Pros: Pros: Software and hybrid schemes: replicate computations fast for fixed policy Cons: FT policy decided at hardware design time mistakes cost millions one-size-fits-all policy expensive immediate deployment policies customized to environment, application reduced hardware cost Cons: for the same universal policy, slower (but not as much as you’d think). It may not actually work! much research in HW/compilers community completely lacking proof Solutions in Time Solutions in Hardware replication of instructions and checking implemented in specialpurpose hardware eg: Reinhardt & Mukherjee 2000 pros: transparent to software cons: one-size-fits-all reliability policy cons: can’t fix existing problem; specialized hardware has reduced market Solutions in Software (or hybrid Hardware/Software) compiler generates replicated instructions and checking code eg: Reis et al. 05 pros: flexibility: new reliability policies may be deployed whenever needed cons: for fixed reliability policy, slower than specialized hardware solutions cons: it might not actually work