Dependable Computing Systems Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party Gray & Reuter FT 2: 1 The Airplane Rule A two engine airplane has twice as many engine problems. A thousand-engine airplane has thousands of engine problems Internet: Node fails every 2 weeks Vendors: Disk fails every 40 years Here: node “fails” every 20 minutes disk fails every 2 weeks. Gray & Reuter FT High Speed Network ( 10 Gb/s) Fault Tolerance is KEY! Mask and repair faults 100 Tape Transports = 1,000 tapes = 1 PetaByte 1,000 discs = 10 Terrorbytes 100 Nodes 1 Tips 2: 2 Outline • Does fault tolerance work? • General methods to mask faults. • Software-fault tolerance • Summary Gray & Reuter FT 2: 3 DEPENDABILITY: The 3 ITIES • Reliability / Integrity: Does the right thing (also large MTTF) • Availability: Does it now. Integrity / Security Security Integrity / Reliability Reliability (also large MTTF MTTF+MTTR Availability Availability System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). • Holistic vs Reductionist view Gray & Reuter FT 2: 4 High Availability System Classes Goal: Build Class 6 Systems System Type Unavailable (min/year) 50,000 Unmanaged 5,000 Managed 500 Well Managed 50 Fault Tolerant 5 High-Availability .5 Very-High-Availability .05 Ultra-Availability Gray & Reuter FT Availability 90.% 99.% 99.9% 99.99% 99.999% 99.9999% 99.99999% Availability Class 1 2 3 4 5 6 7 2: 5 Sources of Failures Power Failure: Phone Lines Soft Hard Hardware Modules: Software: MTTF 2000 hr MTTR 1 hr >.1 hr 4000 hr 100,000hr .1 hr 10 hr 10hr (many are transient) 1 Bug/1000 Lines Of Code (after vendor-user testing) => Thousands of bugs in System! Most software failures are transient: dump & restart system. Useful fact: 8,760 hrs/year ~ 10k hr/year Gray & Reuter FT 2: 6 Case Studies - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor 42% Com Lines 12% 25% Application Software 1 1 .2 % Environment 9 .3 % Operations Vendor (hardware and software) 5 Months Application software 9 Months Communications lines 1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To get 10 year mttf must attack all these problems Gray & Reuter FT 2: 7 Case Studies -Tandem Outage Reports to Vendor Summary Tandem EWR Data 1985 1987 1989 Customers EWR Customers Outage Customers Systems Processors Discs Cases Reports Faults Outages 1000 ? 176 2400 7,000 16,000 305 491 592 285 System MTTF 8 years Totals: More than 7,000 More than 30,000 More than 80,000 More than 200,000 Gray & Reuter FT Customer years System years Processor years Disc Years 1300 ? 205 6000 15,000 46,000 227 535 609 294 2000 267 164 9000 25,500 74,000 501 766 892 438 20 years 21 years Systematic Under-reporting But ratios & trends interesting 2: 8 Case Studies - Tandem Trends MTTF improved: WOW! Outages per millennium. Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software Gray & Reuter FT 2: 9 Case Studies - Tandem Trends Reported MTTF by Component Mean Ti me to System Fail ure (years) by Cause 450 400 maintenance 350 300 250 hardware environment 200 operations 150 100 software 50 total 0 1989 1987 1985 1985 SOFTWARE 2 HARDWARE 29 MAINTENANCE 45 OPERATIONS 99 ENVIRONMENT 142 1987 1990 53 33 91 310 162 409 171 136 214 346 SYSTEM 20 21 Years 8 Remember Systematic Under-reporting Gray & Reuter FT Years Years Years Years Years 2: 10 Summary Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New System Software. New Application Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow. Gray & Reuter FT 2: 11 Outline • Does fault tolerance work? • General methods to mask faults. • Software-fault tolerance • Summary Gray & Reuter FT 2: 12 Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults. { } { } Software-fault tolerance is the key to dependability. INVENT IT! Gray & Reuter FT 2: 13 Fault Tolerance Techniques FAIL FAST MODULES: work or stop SPARE MODULES : instant repair time. INDEPENDENT MODULE FAILS by design MTTFPair ~ MTTF2/ MTTR (so want tiny MTTR) MESSAGE BASED OS: Fault Isolation software has no shared memory. SESSION-ORIENTED COMM: Reliable messages detect lost/duplicate messages coordinate messages with commit PROCESS PAIRS :Mask Hardware & Software Faults TRANSACTIONS: give A.C.I.D. (simple fault model) Gray & Reuter FT 2: 14 Example: the FT Bank Fault Tolerant Computer Backup System System MTTF >10 YEAR (except for power & terminals) Modularity & Repair are KEY: vonNeumann needed 20,000x redundancy in wires and switches We use 2x redundancy. Redundant hardware can support peak loads (so not redundant) Gray & Reuter FT 2: 15 Fail-Fast is Good, Repair is Needed Fault Lifecycle of a module fail-fast gives short fault latency Detect Repair High Availability is low UN-Availability Unavailability MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. Gray & Reuter FT 2: 16 Hardware Reliability/Availability (how to make it fail fast) Basic FailFast Designs Pair Triplex Recursive Designs Recursive Availability Designs Triple Modular Redundancy Pair & Spare + + Comparitor Strategies: Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus) vs Fail-Soft: fail if both fail (e.g. disc, atm,...) Note: in recursive pairs, parent knows which is bad. Triplex: Gray & Reuter FT Fail-Fast: fail if 2 fail (triplexed cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus) 2: 17 Redundant Designs Have Worse MTTF! : Duplex fail fast mttf/2 mttf/1 mttf/2 0 2 1 work work work Duplex: fail soft 1.5*mttf mttf/2 mttf/1 1 0 2 work work work TMR: fail fast 5/6*mttf mttf/1 mttf/3 mttf/2 1 0 3 2 work work work work TMR: fail soft 11/6*mttf mttf/3 mttf/2 mttf/1 1 0 3 2 work work work work Pair & Spare: fail fast 3/4*mttf mttf/4 mttf/1 0 mttf/2 0 4 1 3 2 work work work work work Pair & Spare: fail soft ~2.1*mttf mttf/4 mttf/3 mttf/2 mttf 0 4 3 2 1 work work work work work THIS IS NOT GOOD: Variance is lower but MTTF is worse Simple redundancy does not improve MTTF (sometimes hurts). Gray & Reuter FT This is just an example of the airplane rule. 2: 18 Add Repair: Get 104 Improvement Duplex: fail fast: mtbf/2 mttf/2 mttf/3 mttf/1 2 1 0 work mttrwork mttr work Duplex: fail soft mttf/2 TMR: fail fast 4 10 mttf mttf/1 1 0 2 work mttr work mttrwork mttf/2 0 1 3 2 work mttr work mttrwork mttrwork Availability estimates 1 year MTTF modules 12-hour MTTR MTTF EQUATION SIMPLEX 1 year MTTF mttf/3 : mttf/2 ~0.mttf/1 DUPLEX 5 - MTTF/2 FAIL FAST years 0 (3/2) 3 DUPLEX:2FAIL 1~1. 5 - MTTF yearsmttrwork workSOFT mttf/2 mttr work mttrwork TRIPLEX: . 8 year - MTTF(5/6) TMR: fail soft FAIL FAST TRIPLEX: FAIL SOFT Pair and spare: FAIL -FAST TRIPLEX WITH REPAIR Duplex fail soft + Gray & Reuter FT 4 10 mttf REPAIR 5 10 mttf 1. 8 year ~. 7 year >105 years >104 years COST 1 2+ 2+ 3+ - 1. 8MTTF - 3+ MTTF(3/4) 4+ MTTF3/3MTTR 3+ 2 MTTF2/2MTTR 4+ 2: 19 When To Repair? Chances Of Tolerating A Fault are 1000:1 (class 3) A 1995 study: Processor & Disc Rated At ~ 10khr MTTF Computed Single Observed Failures Double Fails Ratio 10k Processor Fails 14 Double ~ 1000 : 1 40k Disc Fails, 26 Double ~ 1000 : 1 Hardware Maintenance: On-Line Maintenance "Works" 999 Times Out Of 1000. The chance a duplexed disc will fail during maintenance ~ 1:1000 Risk Is 30x Higher During Maintenance => Do It Off Peak Hour Software Maintenance: Repair Only Virulent Bugs Wait For Next Release To Fix Benign Bugs Gray & Reuter FT 2: 20 OK: So Far Hardware fail-fast is easy Redundancy plus Repair is great (Class 7 availability) Hardware redundancy & repair is via modules. How can we get instant software repair? We Know How To Get Reliable Storage RAID Or Dumps And Transaction Logs. We Know How To Get Available Storage Fail Soft Duplexed Discs (RAID 1...N). ? HOW DO WE GET RELIABLE EXECUTION? ? HOW DO WE GET AVAILABLE EXECUTION? Gray & Reuter FT 2: 21 Outline • Does fault tolerance work? • General methods to mask faults. • Software-fault tolerance • Summary Gray & Reuter FT 2: 22 Software Techniques: Learning from Hardware Most outages in Fault Tolerant Systems are SOFTWARE Fault Avoidance Techniques: Good & Correct design. After that: Software Fault Tolerance Techniques: Modularity (isolation, fault containment) Design diversity N-Version Programming: N-different implementations Defensive Programming: Check parameters and data Auditors: Check data structures in background Transactions: to clean up state after a failure Paradox: Need Fail-Fast Software Gray & Reuter FT 2: 23 Fail-Fast and High-Availability Execution Software N-Plexing: Design Diversity N-Version Programming Write the same program N-Times (N > 3) Compare outputs of all programs and take majority vote Process Pairs: Instant restart (repair) Use Defensive programming to make a process fail-fast Have restarted process ready in separate environment Second process “takes over” if primary faults Transaction mechanism can clean up distributed state if takeover in middle of computation. LOGICAL PROCESS = PROCESS PAIR SESSION Gray & Reuter FT PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS 2: 24 What Is MTTF of N-Version Program? First fails after MTTF/N Second fails after MTTF/(N-1),... so MTTF(1/N + 1/(N-1) + ... + 1/2) harmonic series goes to infinity, but VERY slowly for example 100-version programming gives ~4 MTTF of 1-version programming Reduces variance N-Version Programming Needs REPAIR If a program fails, must reset its state from other programs. => programs have common data/state representation. How does this work for Database Systems? Operating Systems? Network Systems? Answer: I don’t know. Gray & Reuter FT 2: 25 Why Process Pairs Mask Faults Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines Tandem Spooler Adams 5:1 100:1 >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985. Gray & Reuter FT 2: 26 Process Pair Repair Strategy If software fault (bug) is a Bohrbug, then there is no repair LOGICAL PROCESS = PROCESS PAIR “wait for the next release” or “get an emergency bug fix” or “get a new vendor” SESSION PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS If software fault is a Heisenbug, then repair is reboot and retry or switch to backup process (instant restart) PROCESS PAIRS Tolerate Hardware Faults Heisenbugs Repair time is seconds, could be mili-seconds if time is critical Flavors Of Process Pair: Lockstep Automatic State Checkpointing Delta Checkpointing Persistent Gray & Reuter FT 2: 27 How Takeover Masks Failures Server Resets At Takeover But What About LOGICAL PROCESS = PROCESS PAIR SESSION PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS Application State? Database State? Network State? Answer: Use Transactions To Reset State! Abort Transaction If Process Fails. Keeps Network "Up" Keeps System "Up" Reprocesses Some Transactions On Failure Gray & Reuter FT 2: 28 PROCESS PAIRS - SUMMARY Transactions Give Reliability Process Pairs Give Availability Process Pairs Are Expensive & Hard To Program Transactions + Persistent Process Pairs => Fault Tolerant Sessions & Execution When Tandem Converted To This Style Saved 3x Messages Saved 5x Message Bytes Made Programming Easier Gray & Reuter FT 2: 29 SYSTEM PAIRS FOR HIGH AVAILABILITY Programs, Data, Processes Replicated at two sites. Pair looks like a single system. System becomes logical concept Like Process Pairs: System Pairs. Backup receives transaction log (spooled if backup down). If primary fails or operator Switches, backup offers service. Gray & Reuter FT 2: 30 SYSTEM PAIR CONFIGURATION OPTIONS Mutual Backup: each has 1/2 of Database & Application Hub: One site acts as backup for many others In General can be any directed graph Backup Primary Primary Primary Primary Primary Backup Primary Primary Primary Primary Backup Backup Stale replicas: Lazy replication Copy Copy Copy Gray & Reuter FT Copy Copy Copy 2: 31 SYSTEM PAIRS FOR: SOFTWARE MAINTENANCE (Primary) V1 (Backup ) V1 St ep 1: Bot h systems are running V1. (Bac kup ) V1 (Primary) V2 Step 3 : SWITCH to Backup. (Primary) V1 (Backup ) V2 Step 2: Backup is cold-loaded as V2. (Backup ) V2 (Prim ary ) V2 Step 4: Backup is cold-loaded as V2 D3 0. Similar ideas apply to: Database Reorganization Hardware modification (e.g. add discs, processors,...) Hardware maintenance Environmental changes (rewire, new air conditioning) Move primary or backup to new location. Gray & Reuter FT 2: 32 SYSTEM PAIR BENEFITS Protects against ENVIRONMENT: different sites weather utilities sabotage Protects against OPERATOR FAILURE: two sites, two sets of operators Protects against MAINTENANCE OUTAGES work on backup software/hardware install/upgrade/move... Protects against HARDWARE FAILURES backup takes over Protects against TRANSIENT SOFTWARE ERRORS Commercial systems: Digital's Remote Transaction Router (RTR) Tandem's Remote Database Facility (RDF) IBM's Cross Recovery XRF( both in same campus) Oracle, Sybase, Informix, Microsoft... replication Gray & Reuter FT 2: 33 SUMMARY FT systems fail for the conventional reasons Environment mostly People sometimes Software mostly Hardware Rarely MTTF of FT SYSTEMS ~ 50X conventional ~ years vs weeks Fail-Fast Modules + Reconfiguration + Repair => Good Hardware Fault Tolerance Transactions + Process Pairs => Good Software Fault Tolerance (Repair) System Pairs Hide Many Faults Challenge: Tolerate Human Errors (make system simpler to manage, operate, and maintain) 2: 34 Gray & Reuter FT Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults. { } { } Software-fault tolerance is the key to dependability. INVENT IT! Gray & Reuter FT 2: 35 References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Gray & Reuter FT 2: 36