Dependable Computing Systems

advertisement
Dependable Computing Systems
Jim Gray
Microsoft, Gray @ Microsoft.com
Andreas Reuter
International University, Andreas.Reuter@i-u.de
Mon
Tue
Wed
Thur
Fri
9:00
Overview
TP mons
Log
Files &Buffers
B-tree
11:00
Faults
Lock Theory
ResMgr
COM+
Access Paths
1:30
Tolerance
Lock Techniq
CICS & Inet
Corba
Groupware
3:30
T Models
Queues
Adv TM
Replication
Benchmark
7:00
Party
Workflow
Cyberbrick
Party
Gray & Reuter FT
2: 1
The Airplane Rule
A two engine airplane has twice as many engine
problems.
A thousand-engine airplane has thousands of
engine problems
Internet: Node fails every 2 weeks
Vendors: Disk fails every 40 years
Here: node “fails” every 20 minutes
disk fails every 2 weeks.
Gray & Reuter FT
High Speed Network ( 10 Gb/s)
Fault Tolerance is KEY!
Mask and repair faults
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
1,000 discs =
10 Terrorbytes
100 Nodes
1 Tips
2: 2
Outline
• Does fault tolerance work?
• General methods to mask faults.
• Software-fault tolerance
• Summary
Gray & Reuter FT
2: 3
DEPENDABILITY: The 3 ITIES
• Reliability / Integrity: Does the right thing
(also
large MTTF)
• Availability: Does it now.
Integrity / Security
Security
Integrity /
Reliability
Reliability
(also large
MTTF
MTTF+MTTR
Availability
Availability
System Availability:
If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
• Holistic vs Reductionist view
Gray & Reuter FT
2: 4
High Availability System Classes
Goal: Build Class 6 Systems
System Type
Unavailable
(min/year)
50,000
Unmanaged
5,000
Managed
500
Well Managed
50
Fault Tolerant
5
High-Availability
.5
Very-High-Availability
.05
Ultra-Availability
Gray & Reuter FT
Availability
90.%
99.%
99.9%
99.99%
99.999%
99.9999%
99.99999%
Availability
Class
1
2
3
4
5
6
7
2: 5
Sources of Failures
Power Failure:
Phone Lines
Soft
Hard
Hardware Modules:
Software:
MTTF
2000 hr
MTTR
1 hr
>.1 hr
4000 hr
100,000hr
.1 hr
10 hr
10hr
(many are transient)
1 Bug/1000 Lines Of Code (after vendor-user testing)
=> Thousands of bugs in System!
Most software failures are transient: dump & restart system.
Useful fact: 8,760 hrs/year ~ 10k hr/year
Gray & Reuter FT
2: 6
Case Studies - Japan
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
Vendor
42%
Com Lines
12%
25%
Application
Software
1 1 .2
%
Environment
9 .3 % Operations
Vendor (hardware and software)
5 Months
Application software
9 Months
Communications lines 1.5 Years
Operations
2 Years
Environment
2 Years
10 Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To get 10 year mttf
must attack all these problems
Gray & Reuter FT
2: 7
Case Studies -Tandem
Outage Reports to Vendor
Summary Tandem EWR Data
1985
1987
1989
Customers
EWR Customers
Outage Customers
Systems
Processors
Discs
Cases
Reports
Faults
Outages
1000
?
176
2400
7,000
16,000
305
491
592
285
System MTTF
8 years
Totals:
More than 7,000
More than 30,000
More than 80,000
More than 200,000
Gray & Reuter FT
Customer years
System years
Processor years
Disc Years
1300
?
205
6000
15,000
46,000
227
535
609
294
2000
267
164
9000
25,500
74,000
501
766
892
438
20 years 21 years
Systematic Under-reporting
But ratios & trends interesting
2: 8
Case Studies - Tandem Trends
MTTF improved: WOW! Outages per millennium.
Shift from Hardware & Maintenance to from 50% to 10%
to
Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of
Environment
Operations errors
Application Software
Gray & Reuter FT
2: 9
Case Studies - Tandem Trends
Reported MTTF by Component
Mean Ti me to System Fail ure (years)
by Cause
450
400
maintenance
350
300
250
hardware
environment
200
operations
150
100
software
50
total
0
1989
1987
1985
1985
SOFTWARE
2
HARDWARE
29
MAINTENANCE
45
OPERATIONS
99
ENVIRONMENT
142
1987 1990
53 33
91 310
162 409
171 136
214 346
SYSTEM
20 21 Years
8
Remember Systematic Under-reporting
Gray & Reuter FT
Years
Years
Years
Years
Years
2: 10
Summary
Current Situation: ~4-year MTTF
=> Fault Tolerance Works.
Hardware is GREAT (maintenance and MTTF).
Software masks most hardware faults.
Many hidden software outages in operations:
New System Software.
New Application Software.
Utilities.
Must make all software ONLINE.
Software seems to define a 30-year MTTF ceiling.
Reasonable Goal: 100-year MTTF.
class 4 today => class 6 tomorrow.
Gray & Reuter FT
2: 11
Outline
• Does fault tolerance work?
• General methods to mask faults.
• Software-fault tolerance
• Summary
Gray & Reuter FT
2: 12
Key Idea
Architecture
Hardware Faults
Software
Masks
Environmental Faults
Distribution
Maintenance
• Software automates / eliminates operators
So,
• In the limit there are only software & design faults.
{
}
{
}
Software-fault tolerance is the key to dependability.
INVENT IT!
Gray & Reuter FT
2: 13
Fault Tolerance Techniques
FAIL FAST MODULES: work or stop
SPARE MODULES : instant repair time.
INDEPENDENT MODULE FAILS by design
MTTFPair ~ MTTF2/ MTTR (so want tiny MTTR)
MESSAGE BASED OS: Fault Isolation
software has no shared memory.
SESSION-ORIENTED COMM: Reliable messages
detect lost/duplicate messages
coordinate messages with commit
PROCESS PAIRS :Mask Hardware & Software Faults
TRANSACTIONS: give A.C.I.D. (simple fault model)
Gray & Reuter FT
2: 14
Example: the FT Bank
Fault Tolerant Computer
Backup System
System MTTF >10 YEAR (except for power & terminals)
Modularity & Repair are KEY:
vonNeumann needed 20,000x redundancy in wires and switches
We use 2x redundancy.
Redundant hardware can support peak loads (so not redundant)
Gray & Reuter FT
2: 15
Fail-Fast is Good, Repair is Needed
Fault
Lifecycle of a module
fail-fast gives
short fault latency
Detect
Repair
High Availability
is low UN-Availability
Unavailability MTTR
MTTF
Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.
Gray & Reuter FT
2: 16
Hardware Reliability/Availability
(how to make it fail fast)
Basic FailFast Designs
Pair
Triplex
Recursive Designs
Recursive Availability Designs
Triple Modular Redundancy
Pair & Spare + +
Comparitor Strategies:
Duplex:
Fail-Fast: fail if either fails (e.g. duplexed cpus)
vs
Fail-Soft: fail if both fail (e.g. disc, atm,...)
Note: in recursive pairs, parent knows which is bad.
Triplex:
Gray & Reuter FT
Fail-Fast: fail if 2 fail (triplexed cpus)
Fail-Soft: fail if 3 fail (triplexed FailFast cpus)
2: 17
Redundant Designs Have Worse MTTF!
: Duplex fail fast
mttf/2
mttf/1
mttf/2
0
2
1
work
work
work
Duplex: fail soft
1.5*mttf
mttf/2
mttf/1
1
0
2
work
work
work
TMR: fail fast
5/6*mttf
mttf/1
mttf/3
mttf/2
1
0
3
2
work
work
work
work
TMR: fail soft
11/6*mttf
mttf/3
mttf/2
mttf/1
1
0
3
2
work
work
work
work
Pair & Spare: fail fast
3/4*mttf
mttf/4
mttf/1
0 mttf/2
0
4
1
3
2
work
work
work work work
Pair & Spare: fail soft
~2.1*mttf
mttf/4 mttf/3 mttf/2
mttf
0
4
3
2
1
work
work work work work
THIS IS NOT GOOD: Variance is lower but MTTF is worse
Simple redundancy does not improve MTTF (sometimes hurts).
Gray & Reuter FT
This is just an example of the airplane rule.
2: 18
Add Repair: Get 104 Improvement
Duplex: fail fast:
mtbf/2
mttf/2
mttf/3
mttf/1
2
1
0
work mttrwork mttr work
Duplex: fail soft
mttf/2
TMR: fail fast
4
10 mttf
mttf/1
1
0
2
work mttr work mttrwork
mttf/2
0
1
3
2
work mttr work mttrwork mttrwork
Availability estimates
1 year MTTF modules
12-hour MTTR
MTTF
EQUATION
SIMPLEX
1 year
MTTF
mttf/3 : mttf/2 ~0.mttf/1
DUPLEX
5
- MTTF/2
FAIL FAST
years
0 (3/2)
3 DUPLEX:2FAIL
1~1. 5 - MTTF
yearsmttrwork
workSOFT
mttf/2
mttr work
mttrwork
TRIPLEX:
. 8 year - MTTF(5/6)
TMR: fail soft
FAIL FAST
TRIPLEX:
FAIL SOFT
Pair and spare:
FAIL -FAST
TRIPLEX WITH
REPAIR
Duplex fail soft +
Gray & Reuter FT
4
10 mttf
REPAIR
5
10 mttf
1. 8
year
~. 7
year
>105
years
>104
years
COST
1
2+
2+
3+
- 1. 8MTTF
-
3+
MTTF(3/4)
4+
MTTF3/3MTTR
3+
2
MTTF2/2MTTR 4+
2: 19
When To Repair?
Chances Of Tolerating A Fault are 1000:1 (class 3)
A 1995 study: Processor & Disc Rated At ~ 10khr MTTF
Computed Single
Observed
Failures
Double Fails
Ratio
10k Processor Fails
14 Double
~ 1000 : 1
40k Disc Fails,
26 Double
~ 1000 : 1
Hardware Maintenance:
On-Line Maintenance "Works" 999 Times Out Of 1000.
The chance a duplexed disc will fail during maintenance ~ 1:1000
Risk Is 30x Higher During Maintenance
=> Do It Off Peak Hour
Software Maintenance:
Repair Only Virulent Bugs
Wait For Next Release To Fix Benign Bugs
Gray & Reuter FT
2: 20
OK: So Far
Hardware fail-fast is easy
Redundancy plus Repair is great (Class 7 availability)
Hardware redundancy & repair is via modules.
How can we get instant software repair?
We Know How To Get Reliable Storage
RAID Or Dumps And Transaction Logs.
We Know How To Get Available Storage
Fail Soft Duplexed Discs (RAID 1...N).
? HOW DO WE GET RELIABLE EXECUTION?
? HOW DO WE GET AVAILABLE EXECUTION?
Gray & Reuter FT
2: 21
Outline
• Does fault tolerance work?
• General methods to mask faults.
• Software-fault tolerance
• Summary
Gray & Reuter FT
2: 22
Software Techniques:
Learning from Hardware
Most outages in Fault Tolerant Systems are SOFTWARE
Fault Avoidance Techniques: Good & Correct design.
After that: Software Fault Tolerance Techniques:
Modularity (isolation, fault containment)
Design diversity
N-Version Programming: N-different implementations
Defensive Programming: Check parameters and data
Auditors: Check data structures in background
Transactions: to clean up state after a failure
Paradox: Need Fail-Fast Software
Gray & Reuter FT
2: 23
Fail-Fast and High-Availability
Execution
Software N-Plexing: Design Diversity
N-Version Programming
Write the same program N-Times (N > 3)
Compare outputs of all programs and take majority vote
Process Pairs: Instant restart (repair)
Use Defensive programming to make a process fail-fast
Have restarted process ready in separate environment
Second process “takes over” if primary faults
Transaction mechanism can clean up distributed state
if takeover in middle of computation.
LOGICAL PROCESS = PROCESS PAIR
SESSION
Gray & Reuter FT
PRIMARY
PROCESS
STATE
INFORMATION
BACKUP
PROCESS
2: 24
What Is MTTF of N-Version Program?
First fails after MTTF/N
Second fails after MTTF/(N-1),...
so MTTF(1/N + 1/(N-1) + ... + 1/2)
harmonic series goes to infinity, but VERY slowly
for example 100-version programming gives
~4 MTTF of 1-version programming
Reduces variance
N-Version Programming Needs REPAIR
If a program fails, must reset its state from other
programs.
=> programs have common data/state representation.
How does this work for
Database Systems?
Operating Systems?
Network Systems?
Answer: I don’t know.
Gray & Reuter FT
2: 25
Why Process Pairs Mask Faults
Many Software Faults are Soft
After
Design Review
Code Inspection
Alpha Test
Beta Test
10k Hrs Of Gamma Test (Production)
Most Software Faults Are Transient
MVS Functional Recovery Routines
Tandem Spooler
Adams
5:1
100:1
>100:1
Terminology:
Heisenbug: Works On Retry
Bohrbug: Faults Again On Retry
Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984
Gray: "Why Do Computers Stop", Tandem TR85.7, 1985
Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
Gray & Reuter FT
2: 26
Process Pair Repair Strategy
If software fault (bug) is a Bohrbug, then there is no
repair
LOGICAL PROCESS = PROCESS PAIR
“wait for the next release” or
“get an emergency bug fix” or
“get a new vendor”
SESSION
PRIMARY
PROCESS
STATE
INFORMATION
BACKUP
PROCESS
If software fault is a Heisenbug, then repair is
reboot and retry or
switch to backup process (instant restart)
PROCESS PAIRS Tolerate
Hardware Faults
Heisenbugs
Repair time is seconds, could be mili-seconds if time is
critical
Flavors Of Process Pair:
Lockstep
Automatic
State Checkpointing
Delta Checkpointing
Persistent
Gray & Reuter FT
2: 27
How Takeover Masks Failures
Server Resets At Takeover But What About
LOGICAL PROCESS = PROCESS PAIR
SESSION
PRIMARY
PROCESS
STATE
INFORMATION
BACKUP
PROCESS
Application State?
Database State?
Network State?
Answer: Use Transactions To Reset State!
Abort Transaction If Process Fails.
Keeps Network "Up"
Keeps System "Up"
Reprocesses Some Transactions On Failure
Gray & Reuter FT
2: 28
PROCESS PAIRS - SUMMARY
Transactions Give Reliability
Process Pairs Give Availability
Process Pairs Are Expensive & Hard To Program
Transactions + Persistent Process Pairs
=> Fault Tolerant Sessions & Execution
When Tandem Converted To This Style
Saved 3x Messages
Saved 5x Message Bytes
Made Programming Easier
Gray & Reuter FT
2: 29
SYSTEM PAIRS
FOR HIGH AVAILABILITY
Programs, Data, Processes Replicated at two sites.
Pair looks like a single system.
System becomes logical concept
Like Process Pairs: System Pairs.
Backup receives transaction log (spooled if backup down).
If primary fails or operator Switches, backup offers service.
Gray & Reuter FT
2: 30
SYSTEM PAIR
CONFIGURATION OPTIONS
Mutual Backup:
each has 1/2 of Database & Application
Hub:
One site acts as backup for many others
In General can be any directed graph
Backup
Primary
Primary
Primary
Primary
Primary
Backup
Primary
Primary
Primary
Primary
Backup
Backup
Stale replicas: Lazy replication
Copy
Copy
Copy
Gray & Reuter FT
Copy
Copy
Copy
2: 31
SYSTEM PAIRS FOR:
SOFTWARE MAINTENANCE
(Primary)
V1
(Backup )
V1
St ep 1: Bot h systems are running V1.
(Bac kup )
V1
(Primary)
V2
Step 3 : SWITCH to Backup.
(Primary)
V1
(Backup )
V2
Step 2: Backup is cold-loaded as V2.
(Backup )
V2
(Prim ary )
V2
Step 4: Backup is cold-loaded as V2 D3 0.
Similar ideas apply to:
Database Reorganization
Hardware modification (e.g. add discs, processors,...)
Hardware maintenance
Environmental changes (rewire, new air conditioning)
Move primary or backup to new location.
Gray & Reuter FT
2: 32
SYSTEM PAIR BENEFITS
Protects against ENVIRONMENT: different sites
weather
utilities
sabotage
Protects against OPERATOR FAILURE:
two sites, two sets of operators
Protects against MAINTENANCE OUTAGES
work on backup
software/hardware install/upgrade/move...
Protects against HARDWARE FAILURES
backup takes over
Protects against TRANSIENT SOFTWARE ERRORS
Commercial systems:
Digital's Remote Transaction Router (RTR)
Tandem's Remote Database Facility (RDF)
IBM's Cross Recovery XRF( both in same campus)
Oracle, Sybase, Informix, Microsoft... replication
Gray & Reuter FT
2: 33
SUMMARY
FT systems fail for the conventional reasons
Environment
mostly
People
sometimes
Software
mostly
Hardware
Rarely
MTTF of FT SYSTEMS
~ 50X conventional
~ years vs weeks
Fail-Fast Modules + Reconfiguration + Repair =>
Good Hardware Fault Tolerance
Transactions + Process Pairs =>
Good Software Fault Tolerance (Repair)
System Pairs Hide Many Faults
Challenge: Tolerate Human Errors
(make system simpler to manage, operate, and maintain) 2: 34
Gray & Reuter FT
Key Idea
Architecture
Hardware Faults
Software
Masks
Environmental Faults
Distribution
Maintenance
• Software automates / eliminates operators
So,
• In the limit there are only software & design faults.
{
}
{
}
Software-fault tolerance is the key to dependability.
INVENT IT!
Gray & Reuter FT
2: 35
References
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of
Research and Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE
Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on
Reliability in Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE
Transactions on Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and
Implementation: An Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology.
15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc
10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.
Gray & Reuter FT
2: 36
Download