Past High Availability Standards Efforts

advertisement
Past
High Availability Standards
Efforts
Jim Gray
Microsoft
http://Research.Microsoft.com/~Gray/Talks/
MTTR is the key metric
• Availability = MTTR/(MTBF+MTTR)
• UN-availability
 MTTR
MTFB
• How to make things twice as good:
1. Double MTBF or
2.Cut MTTR in 1/2
The NASDAQ Benchmark
•
•
•
•
1994-1995: Unisys, IBM, DEC, Tandem,…
Wanted 5 9’s
Two data centers (hot standby)
Inject faults at a node and measure repair time.
(cpu, disk, controller, NIC, OS, DB, …)
required 1-10 second repair
• Fail site, watch standby take over:
required 1 minute repair (transparent to client)
RAID Advisory Board
http://www.raid-advisory.com/
• Failure Resistant Disk System(FRDS)
– Repair
• Failure Tolerant Disk System (FTDS)
– Mask
• Disaster Tolerant Disk System (DTDS)
– 1 km separation
• Array Controller (FRAC,FTAC, DTAC)
– NO SPECIFIC TIME OR COST METRICS
EDAP Classification Criteria for Disk Systems
1) Protection Against Data Loss And Loss of Access To Data Due To Disk Failure.
2) Reconstruction Of Failed Disk Contents To A Replacement Disk.
3) Protection Against Data Loss Due To A "Write Hole".
4) Protection Against Data Loss Due To Attached Equipment Failures.
5) Protection Against Data Loss Due To Component Failure.
6) FRU Monitoring And Failure Indication.
7) Disk Hot Swap.
8) Protection Against Data Loss Due To Cache Component Failure.
9) Protection Against Data Loss Due To External Power Failure.
10) Protection Against Data Loss Due To A Temperature Out Of Operating Range Condition.
11) Component And Environmental Failure Warning.
12) Protection Against Loss Of Access To Data Due To Component Failure, Excluding Cache.
13) Protection Against Loss Of Access To Data Due To Cache Component Failure.
14) Protection Against Loss Of Access To Data Due To Attached Equipment Failures.
15) Protection Against Loss Of Access To Data Due To External Power Failure
16) Protection Against Loss Of Data Access Due To FRU Replacement.
17) Disk Hot Spare.
18) Protection Against Data Loss And Loss Of Access To Data Due To Multiple Disk Failures In An FTDS+.
19) Protection Against Loss Of Data Access Due To Zone Failure.
20) Long Distance Protection Against Loss Of Data Access Due To Zone Failure.
TPC effort (1997)
• Started by John Kemeny & Dean Brock of Data General
• Idea: start with a TPC-C system (benchmark it).
• Then bullet-proof it (RAID, power, backup/restore, geoplex)
• Then re-measure performance & price-performance
– Backup time and impact on online load
– Recovery time for various events (Fail disks, cpus, ctlrs, nics)
– Fail site (disaster recovery with symmetric/asymmetric standby)
• Online Change
– Upgrade software
– Add /replace hardware (cpu, memory, nic, ctlr, disk, tape)
Download