FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline • Terminology and empirical measures • General methods to mask faults. • Software-fault tolerance • Summary 1 Dependability: The 3 ITIES • Reliability / Integrity: does the right thing. (Also large MTTF) Integrity Security Reliability • Availability: does it now. (Also small MTTR MTTF+MTTR Availability System Availability: if 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). • Holistic vs. Reductionist view 2 High Availability System Classes Goal: Build Class 6 Systems Unavailable System Type (min/year) Unmanaged 50,000 Managed 5,000 Well Managed 500 Fault Tolerant 50 High-Availability 5 Very-High-Availability .5 Ultra-Availability .05 Availability 90.% 99.% 99.9% 99.99% 99.999% 99.9999% 99.99999% Availability Class 1 2 3 4 5 6 7 UnAvailability = MTTR/MTBF can cut it in ½ by cutting MTTR or MTBF 3 Demo: looking at some nodes • Look at http://uptime.netcraft.com/ • Internet Node availability: 92% mean, 97% median Darrell Long (UCSC) ftp://ftp.cse.ucsc.edu/pub/tr/ – ucsc-crl-90-46.ps.Z "A Study of the Reliability of Internet Sites" – ucsc-crl-91-06.ps.Z "Estimating the Reliability of Hosts Using the Internet" – ucsc-crl-93-40.ps.Z "A Study of the Reliability of Hosts on the Internet" – ucsc-crl-95-16.ps.Z "A Longitudinal Survey of Internet Host Reliability" 4 Sources of Failures Power Failure: Phone Lines Soft Hard Hardware Modules: Software: MTTF 2000 hr MTTR 1 hr >.1 hr 4000 hr 100,000hr .1 hr 10 hr 10hr (many are transient) 1 Bug/1000 Lines Of Code (after vendor-user testing) => Thousands of bugs in System! Most software failures are transient: dump & restart system. Useful fact: 8,760 hrs/year ~ 10k hr/year 5 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor 4 2% Tele Comm lines 12 % 2 5% Application Software 11.2 % Environment 9.3% Operations Vendor (hardware and software) Application software Communications lines Operations Environment 5 9 1.5 2 2 10 Months Months Years Years Years Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 6 Case Studies - Tandem Trends Reported MTTF by Component Mean Time to System Failure (years) by Cause 450 400 maintenance 350 300 250 hardware environment 200 operations 150 100 software 50 total 0 1985 1987 1989 1985 1987 1990 SOFTWARE HARDWARE MAINTENANCE OPERATIONS ENVIRONMENT 2 29 45 99 142 53 91 162 171 214 33 310 409 136 346 Years Years Years Years Years SYSTEM 8 20 21 Years Problem: Systematic Under-reporting 7 Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines Tandem Spooler Adams 5:1 100:1 >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 8 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985. Summary of FT Studies • Current Situation: ~4-year MTTF => Fault Tolerance Works. • Hardware is GREAT (maintenance and MTTF). • Software masks most hardware faults. • Many hidden software outages in operations: – New Software. – Utilities. • Must make all software ONLINE. • Software seems to define a 30-year MTTF ceiling. • Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow. 9 Fault Tolerance vs Disaster Tolerance • Fault-Tolerance: mask local faults – RAID disks – Uninterruptible Power Supplies – Cluster Failover • Disaster Tolerance: masks site failures – Protects against fire, flood, sabotage,.. – Redundant system and service at remote site. – Use design diversity 10 Outline • Terminology and empirical measures • General methods to mask faults. • Software-fault tolerance • Summary 11 Fault Model • Failures are independent So, single fault tolerance is a big win • Hardware fails fast (blue-screen) • Software fails-fast (or goes to sleep) • Software often repaired by reboot: – Heisenbugs • Operations tasks: major source of outage – Utility operations – Software upgrades 12 Fault Tolerance Techniques • Fail fast modules: work or stop • Spare modules : instant repair time. • Independent module fails by design MTTFPair ~ MTTF2/ MTTR (so want tiny MTTR) • Message based OS: Fault Isolation software has no shared memory. • Session-oriented comm: Reliable messages detect lost/duplicate messages coordinate messages with commit • Process pairs :Mask Hardware & Software Faults • Transactions: give A.C.I.D. (simple fault model) 13 Example: the FT Bank Fault Tolerant Computer Backup System System MTTF >10 YEAR (except for power & terminals) Modularity & Repair are KEY: vonNeumann needed 20,000x redundancy in wires and switches We use 2x redundancy. Redundant hardware can support peak loads (so not redundant) 14 Fail-Fast is Good, Repair is Needed Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. 15 Hardware Reliability/Availability (how to make it fail fast) Basic FailFast Designs Pair Triplex Recursive Designs Recursive Availability Designs Triple Modular Redundancy Pair & Spare + + Comparitor Strategies: Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus) vs Fail-Soft: fail if both fail (e.g. disc, atm,...) Note: in recursive pairs, parent knows which is bad. Triplex: Fail-Fast: fail if 2 fail (triplexed cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus) 16 Redundant Designs have Worse MTTF! Duplex: fail fast TMR: fail fast mttf/2 5/6*mttf mttf/1 mttf/2 mttf/3 0 work 1 work 2 work Duplex: fail soft TMR: fail soft 1.5*mttf mttf/2 11/6*mttf mttf/1 mttf/3 0 work 1 work 2 work 3 work Pair & Spare: fail fast mttf/4 0 3 work mttf/1 mttf/2 2 work 0 work 1 work Pair & Spare: fail soft ~2.1*mttf mttf/4 4 work mttf/3 3 work mttf mttf/2 2 work 1 work mttf/2 2 work mttf/1 1 work 0 work The Airplane Rule: 3/4*mttf 4 work 0 work 1 work 2 work 3 work mttf/1 mttf/2 0 work A two-engine airplane has twice as many engine problems as a one engine plane. THIS IS NOT GOOD: Variance is lower but MTTF is worse Simple redundancy does not improve MTTF (sometimes hurts). This is just an example of the airplane rule. 17 Add Repair: Get 104 Improvement Duplex: fail fast: mtbf/2 2 work mttf/2 1 mttr work Duplex: fail soft TMR: fail fast mttf/3 mttf/1 mttr 0 work 4 10 mttf mttf/2 1 2 work mttr work 3 work mttr mttf/2 2 work TMR: fail soft mttf/3 mttf/1 0 mttr work 4 10 mttf 3 work 1 mttr work 0 mttr work 5 10 mttf mttf/2 2 1 work work mttf/2 mttr mttr mttf/1 Availability estimates 1 year MTTF modules 12-hour MTTR MTTF EQUATION SIMPLEX 1 year MTTF DUPLEX: ~0.5 - MTTF/2 FAIL FAST years DUPLEX: FAIL ~1.5 - MTTF(3/2) SOFT years TRIPLEX: .8 year - MTTF(5/6) 0 mttr work FAIL FAST TRIPLEX: FAIL SOFT Pair and spare: FAIL -FAST TRIPLEX WITH REPAIR Duplex fail soft + REPAIR 1.8 year ~. 7 year >105 years >104 years COST 1 2+ 2+ 3+ - 1.8MTTF 3+ - MTTF(3/4) 4+ MTTF3/3MTTR 3+ 2 18 MTTF2/2MTTR 4+ When To Repair? Chances Of Tolerating A Fault are 1000:1 (class 3) A 1995 study: Processor & Disc Rated At ~ 10khr MTTF Computed Single Observed Failures Double Fails Ratio 10k Processor Fails 14 Double ~ 1000 : 1 40k Disc Fails, 26 Double ~ 1000 : 1 Hardware Maintenance: On-Line Maintenance "Works" 999 Times Out Of 1000. The chance a duplexed disc will fail during maintenance?1:1000 Risk Is 30x Higher During Maintenance => Do It Off Peak Hour Software Maintenance: Repair Only Virulent Bugs Wait For Next Release To Fix Benign Bugs 19 OK: So Far Hardware fail-fast is easy Redundancy plus Repair is great (Class 7 availability) Hardware redundancy & repair is via modules. How can we get instant software repair? We Know How To Get Reliable Storage RAID Or Dumps And Transaction Logs. We Know How To Get Available Storage Fail Soft Duplexed Discs (RAID 1...N). ? How do we get reliable execution? ? How do we get available execution? 20 Outline • Terminology and empirical measures • General methods to mask faults. • Software-fault tolerance • Summary 21 Key Idea } Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults. { } { Software-fault tolerance is the key to dependability. INVENT IT! 22 Software Techniques: Learning from Hardware Recall that most outages are not hardware. Most outages in Fault Tolerant Systems are SOFTWARE Fault Avoidance Techniques: Good & Correct design. After that: Software Fault Tolerance Techniques: Modularity (isolation, fault containment) Design diversity N-Version Programming: N-different implementations Defensive Programming: Check parameters and data Auditors: Check data structures in background Transactions: to clean up state after a failure Paradox: Need Fail-Fast Software 23 Fail-Fast and High-Availability Execution Software N-Plexing: Design Diversity N-Version Programming Write the same program N-Times (N > 3) Compare outputs of all programs and take majority vote Process Pairs: Instant restart (repair) Use Defensive programming to make a process fail-fast Have restarted process ready in separate environment Second process “takes over” if primary faults Transaction mechanism can clean up distributed state LOGICAL PROCESS = PROCESS PAIR if takeover in middle of computation. SESSION PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS 24 What Is MTTF of N-Version Program? First fails after MTTF/N Second fails after MTTF/(N-1),... so MTTF(1/N + 1/(N-1) + ... + 1/2) harmonic series goes to infinity, but VERY slowly for example 100-version programming gives ~4 MTTF of 1-version programming Reduces variance N-Version Programming Needs REPAIR If a program fails, must reset its state from other programs. => programs have common data/state representation. How does this work for Database Systems? Operating Systems? Network Systems? 25 Answer: I don’t know. Why Process Pairs Mask Faults: Many Software Faults are Soft After Design Review Code Inspection Alpha Test Beta Test 10k Hrs Of Gamma Test (Production) Most Software Faults Are Transient MVS Functional Recovery Routines Tandem Spooler Adams 5:1 100:1 >100:1 Terminology: Heisenbug: Works On Retry Bohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 26 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985. Heisenbugs: A Probabilistic Approach to Availability There is considerable evidence that (1) production systems have about one bug per thousand lines of code (2) these bugs manifest themselves in stochastically: failures are due to confluence of rare events, (3) system mean-time-to-failure has a lower bound of a decade or so. To make highly available systems, architects must tolerate these failures by providing instant repair (un-availability is approximated by repair_time/time_to_fail so cutting the repair time in half makes things twice as good. Ultimately, one builds a set of standby servers which have both design diversity and geographic diversity. This minimizes 27 common-mode failures. Process Pair Repair Strategy If software fault (bug) is a Bohrbug, then there is no repair “wait for the next release” or “get an emergency bug fix” or “get a new vendor” LOGICAL PROCESS = PROCESS PAIR SESSION If software fault is a Heisenbug, then repair is PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS reboot and retry or switch to backup process (instant restart) PROCESS PAIRS Tolerate Hardware Faults Heisenbugs Repair time is seconds, could be mili-seconds if time is critical Flavors Of Process Pair: Lockstep Automatic State Checkpointing Delta Checkpointing Persistent 28 How Takeover Masks Failures Server Resets At Takeover But What About LOGICAL PROCESS = PROCESS PAIR SESSION Answer: PRIMARY PROCESS STATE INFORMATION BACKUP PROCESS Application State? Database State? Network State? Use Transactions To Reset State! Abort Transaction If Process Fails. Keeps Network "Up" Keeps System "Up" Reprocesses Some Transactions On Failure 29 PROCESS PAIRS - SUMMARY Transactions Give Reliability Process Pairs Give Availability Process Pairs Are Expensive & Hard To Program Transactions + Persistent Process Pairs => Fault Tolerant Sessions & Execution When Tandem Converted To This Style Saved 3x Messages Saved 5x Message Bytes Made Programming Easier 30 SYSTEM PAIRS FOR HIGH AVAILABILITY Primary Backup Programs, Data, Processes Replicated at two sites. Pair looks like a single system. System becomes logical concept Like Process Pairs: System Pairs. Backup receives transaction log (spooled if backup down). 31 If primary fails or operator Switches, backup offers service. SYSTEM PAIR CONFIGURATION OPTIONS Mutual Backup: each has1/2 of Database & Application Backup Primary Primary Hub: One site acts as backup for many others In General can be any directed graph Primary Backup Primary Stale replicas: Lazy replication Primary Backup Copy Copy Copy 32 SYSTEM PAIRS FOR: SOFTWARE MAINTENANCE (Primary) V1 (Backup ) V1 St ep 1: Bot h systems are running V1. (Bac kup ) V1 (Primary) V2 Step 3 : SWITCH to Backup. (Primary) V1 (Backup ) V2 Step 2: Backup is cold-loaded as V2. (Backup ) V2 (Prim ary ) V2 Step 4: Backup is cold-loaded as V2 D3 0. Similar ideas apply to: Database Reorganization Hardware modification (e.g. add discs, processors,...) Hardware maintenance Environmental changes (rewire, new air conditioning) Move primary or backup to new location. 33 SYSTEM PAIR BENEFITS Protects against ENVIRONMENT: weather utilities sabotage Protects against OPERATOR FAILURE: two sites, two sets of operators Protects against MAINTENANCE OUTAGES work on backup software/hardware install/upgrade/move... Protects against HARDWARE FAILURES backup takes over Protects against TRANSIENT SOFTWARE ERRORR Allows design diversity 34 different sites have different software/hardware) Key Idea } Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults. Many are Heisenbugs { } { Software-fault tolerance is the key to dependability. INVENT IT! 35 References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, 36 Germany: IEEE, September 1995, pp. 2-9 37 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat O’Neil (U.Mass) 38 • Replication strategies Outline – Lazy and Eager – Master and Group • How centralized databases scale – deadlocks rise non-linearly with • transaction size • concurrency • Replication systems are unstable on scaleup • A possible solution 39 Scaleup, Replication, Partition Base case Scaleup a 1 TPS system to a 2 TPS centralized system 200 Users 2 TPS server Partitioning Replication Two 1 TPS systems Two 2 TPS systems 100 Users 1 tps O tps 100 Users O tps 1 TPS server 1 TPS server 100 Users • N2 more 2 TPS server work 100 Users 1 tps 1 TPS server 100 Users 2 TPS server 40 • Why Replicate Databases? Give users a local copy for – Performance – Availability – Mobility (they are disconnected) • But... What if they update it? • Must propagate updates to other copies 41 Write A Write A Propagation Strategies • Eager: Send update right away – (part of same transaction) – N times larger transactions • Lazy: Send update asynchronously – separate transaction – N times more transactions • Either way – N times more updates per second per node – N2 times more work overall Write A Write B Write B Write B Write C Write C Write C Commit Commit Commit Write A Write B Write C Commit Write A Write B Write C Commit Write A Write B Write C Commit 42 Update Control Strategies • Master – Each object has a master node – All updates start with the master – Broadcast to the subscribers • Group – Object can be updated by anyone – Update broadcast to all others • Everyone wants Lazy Group: – update anywhere, anytime, anyway 43 • EagerQuiz Questions: Name One – Master:N-Plexed disks – Group: • Lazy ? – Master: Bibles, Bank accounts, SQLserver – Group: Name servers, Oracle, Access... • Note: Lazy contradicts Serializable – If two lazy updates collide, then ... reconcile • discard one transaction (or use some other rule) • Ask for human advice • Meanwhile, nodes disagree => – Network DB state diverges: System Delusion 44 Anecdotal Evidence • • • • Update Anywhere systems are attractive Products offer the feature It demos well But when it scales up – Reconciliations start to cascade – Database drifts “out of sync” (System Delusion) • What’s going on? 45 Outline • Replication strategies – Lazy and Eager – Master and Group • How centralized databases scale – deadlocks rise non-linearly • Replication is unstable on scaleup • A possible solution 46 Simple Model of WaitsDBsize records • TPS transactions per second • Each – Picks Actions records uniformly TransctionsxActions from set of DBsize records 2 – Then commits • About Transactions x Actions/2 resources locked Transactions x Actions • Chance a request waits is 2 x DB_size • Action rate is TPS x Actions • Active Transactions TPS x Actions x Action_Time • Wait Rate = Action rate x Chance a request waits • = TPS2 x Actions3 x Action_Time • 10x more transactions, 100x 2more waits x DB_size 47 Simple Model • A deadlock is a wait cycle • Cycle of length 2: of Deadlocks – Wait rate x Chance Waitee waits for waiter – Wait rate x (P(wait) / Transactions) TPS2 x Actions3 x Action_Time 2 x DB_size • TPS2 x Actions5 x Action_Time Cycles of length 3 are2 PW3, so 4 x DB_size TPS x Actions3x Action_Time 2 x DB_size TPS x Actions x Action_Time ignored. • 10x bigger trans = 100,000x more deadlocks 48 Summary So Far Even centralized systems unstable • • Waits: – Square of concurrency – 3rd power of transaction size • Deadlock rate – Square of concurrency – 5th power of transaction size 49 Outline • Replication strategies • How centralized databases scale • Replication is unstable on scaleup • Eager (master & group) • Lazy (master & group & disconnected) • A possible solution 50 Eager Transactions are FAT Write A • If N nodes, eager transaction is Nx bigger Write A Write A Write B – Takes Nx longer – 10x nodes, 1,000x deadlocks – (derivation in paper) Write B Write B Write C Write C Write C Commit Commit Commit • Master slightly better than group • Good news: – Eager transactions only deadlock – No need for reconciliation 51 Write A Lazy Master & Group New Write C Timestamp Write B Commit • Use optimistic concurrency control – Keep transaction timestamp with record – Updates carry old+new timestamp – If record has old timestamp Write A Write B Write A Write C Write B Commit Write C Commit • set value to new value • set timestamp to new timestamp – If record does not match old timestamp • reject lazy transaction – Not SNAPSHOT isolation (stale reads) • Reconciliation: A Lazy Transaction TRID, Timestamp OID, old time, new value – Some nodes are updated – Some nodes are “being reconciled” 52 Reconciliation • Reconciliation means System Delusion – Data inconsistent with itself and reality • How frequent is it? • Lazy transactions are not fat – but N times as many – Eager waits become Lazy reconciliations – Rate is: TPS2 x (Actions x Nodes)3 x Action_Time 2 x DB_size – Assuming everyone is connected 53 Eager & Lazy: Disconnected • Suppose mobile nodes disconnected for a day • When reconnect: – get all incoming updates – send all delayed updates • Incoming is Nodes x TPS x Actions x disconnect_time • Outgoing is: TPS x Actions x Disconnect_Time Action_Time • Conflicts are intersection of these two sets Action_Time Disconnect_Time x (TPS xActions x Nodes)2 DB_size 54 • • • • Replication strategies (lazy & eager, master & group) Outline How centralized databases scale Replication is unstable on scaleup A possible solution – Two-tier architecture: Mobile & Base nodes – Base nodes master objects – Tentative transactions at mobile nodes • Transactions must be commutative – Re-apply transactions on reconnect – Transactions may be rejected 55 • Each object mastered at a node Safe Approach • Update Transactions only read and write master items • Lazy replication to other nodes • Allow reads of stale data (on user request) • PROBLEMS: – doesn’t support mobile users – deadlocks explode with scaleup • ?? How do banks work??? 56 Two Tier Replication • Two kinds of nodes: – Base nodes always connected, always up – Mobile nodes occasionally connected • Data mastered at base nodes • Mobile nodes – have stale copies – make tentative updates Mobile Base Node 57 Mobile Node Makes Tentative Updates • Updates local database while disconnected • Saves transactions • When Mobile node reconnects: Tentative transactions re-done as Eager-Master (at original time??) tentative transactions • Some may be rejected Mobile – (replaces reconciliation) • No System Delusion. base updates & failed base transactions Base Node 58 • Must be commutative others Tentativewith Transactions – Debit 50$ rather than Change 150$ to 100$. • Must have acceptance criteria – Account balance is positive – Ship date no later than quoted – Price is no greater than quoted Tentative Transactions at local DB Transactions From Others Updates & Rejects 59 Refinement: Mobile Node Can Master Some Data • Mobile node can master “private” data – Only mobile node updates this data – Others only read that data • Examples: – Orders generated by salesman – Mail generated by user – Documents generated by Notes user. 60 Virtue of 2-Tier Approach • • • • Allows mobile operation No system delusion Rejects detected at reconnect (know right away) If commutativity works, – No reconciliations – Even though work rises as (Mobile + Base)2 61 Outline • • • • Replication strategies (lazy & eager, master & group) How centralized databases scale Replication is unstable on scaleup A possible solution (two-tier architecture) – Tentative transactions at mobile nodes – Re-apply transactions on reconnect – Transactions may be rejected & reconciled • Avoids system delusion 62