Dependability in the Internet Era 1 Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations 2 Preview The Last 5 Years: Availability Dark Ages Ready for a Renaissance? • Things got better, then things got a lot worse! 99.999% Availability 99.999% 99.99% Cell phones 99.9% 99% Internet 9% 1950 1960 1970 1980 1990 3 2000 DEPENDABILITY: The 3 ITIES • RELIABILITY / INTEGRITY: Does the right thing. (also MTTF>>1) • AVAILABILITY: Does it now. (also 1 >> Integrity Security Reliability MTTR ) MTTF+MTTR Availability System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). • Holistic vs. Reductionist view 4 Fail-Fast is Good, Repair is Needed Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ~ MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. 5 Fault Model • Failures are independent So, single fault tolerance is a big win • Hardware fails fast (dead disk, blue-screen) • Software fails-fast (or goes to sleep) • Software often repaired by reboot: – Heisenbugs • Operations tasks: major source of outage – Utility operations – Software upgrades 6 Disks (raid) the BIG Success Story • Duplex or Parity: masks faults • Disks @ 1M hours (~100 years) • But – controllers fail and – have 1,000s of disks. • Duplexing or parity, and dual path gives “perfect disks” • Wal-Mart never lost a byte (thousands of disks, hundreds of failures). • Only software/operations mistakes are left. 7 Fault Tolerance vs Disaster Tolerance • Fault-Tolerance: mask local faults – RAID disks – Uninterruptible Power Supplies – Cluster Failover • Disaster Tolerance: masks site failures – Protects against fire, flood, sabotage,.. – Redundant system and service at remote site. 8 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor 4 2% Tele Comm lines 12 % 2 5% Application Software 11.2 % Environment 9.3% Operations Vendor (hardware and software) Application software Communications lines Operations Environment 5 Months 9 Months 1.5 Years 2 Years 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 9 Case Studies - Tandem Trends MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors 10 Application Software Dependability Status circa 1995 • ~4-year MTTF => 5 9s for well-managed sys. Fault Tolerance Works. • Hardware is GREAT (maintenance and MTTF). • Software masks most hardware faults. • Many hidden software outages in operations: –New Software. –Utilities. • Make all hardware/software changes ONLINE. • Software seems to define a 30-year MTTF ceiling. • Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.11 What’s Happened Since Then? • Hardware got better • Software got better (even though it is more complex) • Raid is standard, Snapshots coming standard • Cluster in a box: commodity failover • Remote replication is standard. 12 Availability 99999 Availability Un-managed well-managed nodes Masks some hardware failures well-managed packs & clones Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures well-managed GeoPlex Masks site failures (power, network, fire, move,…) Masks some operations failures 13 Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations 14 Progress? • MTTF improved from 1950-1995 • MTTR has not improved much since 1970 failover • Hardware and Software online change (pNp) is now standard • Then the Internet arrived: – No project can take more than 3 months. – Time to market is everything – Change is good. 15 The Internet Changed Expectations 1990 Phones delivered 99.999% ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2000 Cellphones deliver 90% Web sites deliver 98% Failures are business-page news Many hackers. Outages last a “day” This is progress? 16 Why (1) Complexity • Internet sites are MUCH more complex. – – – – – – – – NAP Firewall/proxy/ipsprayer Web DMZ App server DB server Links to other sites tcp/http/html/dhtml/dom/xml/co m/corba/cgi/sql/fs/os… • Skill level is much reduced 17 One of the Data Centers (500 servers) Ca nyon Pa rk Da ta Center Mic rosoft.c om Netw ork Dia g ra m 4A2 1A2 SD SD SYSTEMS SYSTEMS ASX-1000 ASX-1000 3D2 A A A A C SER SER SER SER ETH ETH ETH ETH TX C TX C TX C TX C SELECT SELECT SELECT SELECT PWR PWR PWR PWR PWR PWR PWR PWR AC 48V DC 5VDC OK 5VDC OK SHUTDOWN SHUTDOWN RX L RX L RX L RX L NEXT NEXT NEXT NEXT SELECT SELECT SELECT SELECT AC 48V DC RESET RESET RESET RESET RX L RX L RX L RX L NEXT NEXT NEXT NEXT 2A2 3A2 C ETH ETH ETH ETH TX C TX C TX C TX C SHUTDOWN C SER SER SER SER 5VDC OK RESET RESET RESET RESET AC 48V DC 5VDC OK SHUTDOWN C A A A A C C C C AC 48V DC 4A2 Cisco 7000 1A2 CAUTION:Double Pole/neutral fusing F12A/250V CAUTION:Double Pole/neutral fusing CAUTION:Double Pole/neutral fusing F12A/250V F12A/250V F12A/250V FE4/0/0 D HSRP D D D B B B B D D D D B B B B ICPMDISTFA1001 Cisco 7000 FE5/0/0 CAUTION:Double Pole/neutral fusing HSRP ICPMDISTFA1002 FE5/0/0 2A2 ATM0/0/0.1 ATM0/0/0.1 ATM0/0/0.1 Cisco 7000 Cisco 7000 FE4/0/0 ATM0/0/0.1 Cisco 7000 HSRP HSRP Cisco 7000 ICPMSCOMC7505 FE4/1/0 FE4/1/0 FE4/1/0 ICPMSCOMC7506 ATM0/0/0.1 FE4/1/0 ATM0/0/0.1 Port 1/2 ICPMSCOMC7502 ICPMSCOMC7503 Port 1/1 Port 1/1 Port 2/1 Port 2/1 Catalyst 5000 Catalyst 5000 Port 2/1 DOWNLOAD.MICROSOFT.COM Catalyst 5000 CPMSFTWBD01 CPMSFTWBD05 CPMSFTWBD06 ICPMSCOMC5002 (MSCOM2) ICPMSCOMC5003 (MSCOM3) CPMSFTWBA02 ICPMSCOMC5004 (MSCOM4) IIS DOWNLOAD.MICROSOFT.COM IIS CPMSFTWBD07 CPMSFTWBD08 ACTIVEX.MICROSOFT.COM ICPMSCOMC5001 (MSCOM1) ICPMSFTDLC2922 (MSCOM DL2) FE4/0/0 FE4/0/0 Port 1/1 Port 2/1 Port 1/1 Ca ta lyst 2926 ICPMSFTDLC2921 (MSCOM DL1) HSRP FE4/0/0 Port 1/1 Port 1/2 Ca ta lyst 2926 ICPMSCOMC7504 HSRP FE4/0/0 Catalyst 5000 Port 1/1 HSRP HSRP ICPMSCOMC7501 CPMSFTWBD03 CPMSFTWBD04 CPMSFTWBD09 IIS IIS IIS IIS IIS IIS CPMSFTWBD10 CPMSFTWBD11 HTMLNEWS(pvt).MICROSOFT.COM CPMSFTWBA03 CPMSFTWBV01 CPMSFTWBV02 CPMSFTWBV03 CPMSFTWBV04 CPMSFTWBV05 NTSERVICEPACK.MICROSOFT.COM CPMSFTWBH01 CPMSFTWBH02 WWW.MICROSOFT.COM CPMSFTWBW26 CPMSFTWBW28 CPMSFTWBW30 IIS IIS CPMSFTWBW37 CPMSFTWBW38 CPMSFTWBW39 IIS IIS REGISTER.MICROSOFT.COM CPMSFTWBR01 CPMSFTWBR02 CPMSFTWBR06 WWW.MICROSOFT.COM WWW.MICROSOFT.COM CPMSFTWBW24 CPMSFTWBW31 CPMSFTWBW32 CPMSFTWBW33 CPMSFTWBW34 CPMSFTWBR07 CPMSFTWBR08 CPMSFTWBW08 CPMSFTWBW13 CPMSFTWBW14 CPMSFTWBW29 CPMSFTWBW35 CPMSFTWBW40 CPMSFTWBW41 CPMSFTWBW42 CPMSFTWBW43 CPMSFTWBW01 CPMSFTWBW15 CPMSFTWBW25 IIS CPMSFTWBT03 CPMSFTWBT07 IIS IIS CPMSFTWBT04 CPMSFTWBT05 CPMSFTWBT06 CPMSFTWBT08 IIS IIS CPMSFTWBP01 CPMSFTWBP02 CPMSFTWBP03 WINDOWSMEDIA.MICROSOFT.COM CPMSFTWBJ06 CPMSFTWBJ07 CPMSFTWBJ08 IIS CPMSFTWBS01 CPMSFTWBS02 CPMSFTWBS03 CPMSFTWBS04 CPMSFTWBS05 CPMSFTWBS06 CPMSFTWBS07 CPMSFTWBS08 CPMSFTWBS09 CPMSFTWBS10 CPMSFTWBS11 CPMSFTWBS12 CPMSFTWBS13 CPMSFTWBS14 CPMSFTWBS15 CPMSFTWBS16 CPMSFTWBS17 CPMSFTWBS18 IIS IIS CPMSFTWBJ06 CPMSFTWBJ07 CPMSFTWBJ08 KBSEARCH.MICROSOFT.COM CPMSFTWBJ09 CPMSFTWBJ10 IIS CPMSFTWBT40 CPMSFTWBT41 CPMSFTWBT42 IIS CPMSFTWBO30 CPMSFTWBO31 CPMSFTWBG03 CPMSFTWBG04 CPMSFTWBG05 CPMSFTWBAM03 CPMSFTWBAM04 SvcsWINDOWSMEDIA.MICROSOFT.COM IIS IIS IIS IIS CPMSFTWBO01 CPMSFTWBO02 IIS IIS IIS CPMSFTWBN01 CPMSFTWBN02 CPMSFTWBO32 CPMSFTWBG04 CPMSFTWBG05 CPMSFTWBB01 CPMSFTWBB03 Catalyst 5000 CPMSFTWBV42 Catalyst 5000 INSIDER.MICROSOFT.COM CPMSFTWBI02 IUSCCMQUEC5001 (COMMUNIQUE1) SearchMCSP.MICROSOFT.COM CPMSFTWBB04 CPMSFTWBV41 IIS CPMSFTWBN03 CPMSFTWBN04 CPMSFTWBI01 IUSCCMQUEC5002 (COMMUNIQUE2) CPMSFTWBM03 BACKOFFICE.MICROSOFT.COM CPMSFTWBO04 CPMSFTWBO07 IIS CPMSFTWBJ03 CPMSFTWBJ05 MSDN.MICROSOFT.COM IIS CPMSFTWBV23 MSDNSupport.MICROSOFT.COM CPMSFTWBC03 CPMSFTWBY03 CPMSFTWBY04 CPMSFTWBJ01 CPMSFTWBJ02 CGL.MICROSOFT.COM CPMSFTWBT43 CPMSFTWBT44 OFFICEUPDATE.MICROSOFT.COM CPMSFTWBV21 CPMSFTWBV22 WINDOWSMEDIA.MICROSOFT.COM CPMSFTWBJ09 CPMSFTWBJ10 PremOFFICEUPDATE.MICROSOFT.COM IIS ASKSUPPORT.MICROSOFT.COM CPMSFTWBAM01 CPMSFTWBAM01 CPMSFTWBY01 CPMSFTWBY02 CPMSFTWBJ01 WINDOWSMEDIA.MICROSOFT.COM CPMSFTWBC01 CPMSFTWBC02 CPMSFTWBR09 CPMSFTWBR10 CPMSFTFTPA05 CPMSFTFTPA06 MSDNNews.MICROSOFT.COM WINDOWS.MICROSOFT.COM WINDOWS98.MICROSOFT.COM PREMIUM.MICROSOFT.COM CPMSFTFTPA01 REGISTER.MICROSOFT.COM CPMSFTWBR03 CPMSFTWBR04 CPMSFTWBR05 IIS SUPPORT.MICROSOFT.COM HOTFIX.MICROSOFT.COM CPMSFTWBW27 CPMSFTWBW46 CPMSFTWBW47 CDMICROSOFT.COM IIS CPMSFTWBT01 CPMSFTWBT02 CPMSFTFTPA03 CPMSFTFTPA04 WWW.MICROSOFT.COM IIS CPMSFTWBW36 CPMSFTWBW44 CPMSFTWBW45 SUPPORT.MICROSOFT.COM SEARCH.MICROSOFT.COM IIS IIS FTP.MICROSOFT.COM CPMSFTWBH03 NEWSLETTERS.MICROSOFT.COM IIS IIS IIS IIS WINDOWS_Redir.MICROSOFT.COM CPMSFTSMTPQ01 IIS IIS IIS IIS IIS IIS IIS IIS CPMSFTSMTPQ02 CPMSFTWBY05 NEWSWIRE.MICROSOFT.COM CPMSFTWBJ21 CPMSFTWBJ22 CPITGMSGR01 CODECS.MICROSOFT.COM CPMSFTWBJ16 CPMSFTWBJ17 CPMSFTWBJ18 CPITGMSGR02 COMMUNITIES.MICROSOFT.COM CPMSFTWBJ19 CPMSFTWBJ20 CPMSFTNGXA01 CPMSFTNGXA02 CPMSFTNGXA03 IIS IIS IIS CPMSFTNGXA04 CPMSFTNGXA05 IIS Microsoft.com Stagers, Build and Misc. Servers Build Servers NEWSLETTERS CPMSFTSMTPQ11 CPMSFTSMTPQ12 CPMSFTSMTPQ13 CPMSFTSMTPQ14 CPMSFTSMTPQ15 NEWSWIRE CPMSFTWBQ01 CPMSFTWBQ02 CPMSFTWBQ03 INTERNAL SMTP CPMSFTSMTPR01 CPMSFTSMTPR02 NEWSWIRE CPITGMSGD01 CPITGMSGD02 CPITGMSGD03 STATS CPITGMSGD04 CPITGMSGD05 CPITGMSGD07 CPITGMSGD14 CPITGMSGD15 CPITGMSGD16 CPMSFTSTA14 CPMSFTSTA15 CPMSFTSTA16 COMMUNITIES Dra w n b y: Ma tt Gro sho ng La st Up d a te d : Ap ril 12, 2000 IP a d d re sse s re m o ve d b y Jim Gra y to p ro te c t se c urity INTERNET-BUILD INTERNET-BUILD1 INTERNET-BUILD2 INTERNET-BUILD3 INTERNET-BUILD4 INTERNET-BUILD5 INTERNET-BUILD6 INTERNET-BUILD7 INTERNET-BUILD8 INTERNET-BUILD9 INTERNETBUILD10 INTERNETBUILD11 INTERNETBUILD12 INTERNETBUILD13 INTERNETBUILD14 INTERNETBUILD15 INTERNETBUILD16 INTERNETBUILD17 INTERNETBUILD18 INTERNETBUILD19 INTERNETBUILD20 INTERNETBUILD21 INTERNETBUILD22 INTERNETBUILD23 INTERNETBUILD24 INTERNETBUILD25 INTERNETBUILD26 INTERNETBUILD27 INTERNETBUILD30 INTERNETBUILD31 INTERNETBUILD32 INTERNETBUILD34 INTERNETBUILD36 INTERNETBUILD42 Stagers CPMSFTCRA10 CPMSFTCRA14 CPMSFTCRA15 CPMSFTCRA32 CPMSFTCRB02 CPMSFTCRB03 CPMSFTCRP01 CPMSFTCRP02 CPMSFTCRP03 CPMSFTCRS01 CPMSFTCRS02 CPMSFTCRS03 CPMSFTSGA01 CPMSFTSGA02 CPMSFTSGA03 CPMSFTSGA04 CPMSFTSGA07 PPTP / Terminal Servers CPMSFTPPTP01 CPMSFTPPTP02 CPMSFTPPTP03 CPMSFTPPTP04 CPMSFTTRVA01 CPMSFTTRVA02 CPMSFTTRVA03 Monitoring Servers CPMSFTHMON01 CPMSFTHMON02 CPMSFTHMON03 CPMSFTMONA01 CPMSFTMONA02 CPMSFTMONA03 Microsoft.com Server Count Microsoft.com SQL Servers FTP 6 Build Servers 32 Live SQL Servers CPMSFTSQLA05 CPMSFTSQLA06 CPMSFTSQLA08 CPMSFTSQLA09 CPMSFTSQLA14 CPMSFTSQLA16 CPMSFTSQLA18 CPMSFTSQLA20 CPMSFTSQLA21 CPMSFTSQLA22 CPMSFTSQLA23 CPMSFTSQLA24 CPMSFTSQLA25 CPMSFTSQLA26 CPMSFTSQLA27 CPMSFTSQLA36 CPMSFTSQLA37 CPMSFTSQLA38 CPMSFTSQLA39 Backup SQL Servers SQL SQL SQL SQL CPMSFTSQLB05 CPMSFTSQLB06 CPMSFTSQLB08 CPMSFTSQLB09 CPMSFTSQLB14 CPMSFTSQLB16 CPMSFTSQLB18 CPMSFTSQLB20 CPMSFTSQLB21 Misc. SQL Servers CPMSFTSQLD01 CPMSFTSQLD02 CPMSFTSQLE01 CPMSFTSQLF01 CPMSFTSQLG01 CPMSFTSQLH01 CPMSFTSQLH02 CPMSFTSQLH03 CPMSFTSQLH04 CPMSFTSQLI01 CPMSFTSQLL01 CPMSFTSQLM01 CPMSFTSQLM02 CPMSFTSQLP01 CPMSFTSQLP02 CPMSFTSQLP03 CPMSFTSQLP04 CPMSFTSQLP05 CPMSFTSQLQ01 CPMSFTSQLQ06 CPMSFTSQLR01 CPMSFTSQLR02 CPMSFTSQLR03 CPMSFTSQLR05 CPMSFTSQLR06 CPMSFTSQLR08 CPMSFTSQLR20 CPMSFTSQLS01 CPMSFTSQLS02 CPMSFTSQLW01 CPMSFTSQLW02 CPMSFTSQLX01 CPMSFTSQLX02 CPMSFTSQLZ01 CPMSFTSQLZ02 CPMSFTSQLZ04 CPMSFTSQL01 CPMSFTSQL02 CPMSFTSQL03 SQL SQL SQL SQL SQL SQL SQL SQL CPMSFTSQLC24 CPMSFTSQLC25 CPMSFTSQLC26 CPMSFTSQLC27 CPMSFTSQLC30 CPMSFTSQLC36 CPMSFTSQLC37 CPMSFTSQLC38 CPMSFTSQLC39 Cisco 7000 Catalyst 5000 FE4/1/0 CPMSFTSQLB22 CPMSFTSQLB23 CPMSFTSQLB24 CPMSFTSQLB25 CPMSFTSQLB26 CPMSFTSQLB27 CPMSFTSQLB36 CPMSFTSQLB37 CPMSFTSQLB38 CPMSFTSQLB39 Consolidator SQL Servers CPMSFTSQLC02 CPMSFTSQLC03 CPMSFTSQLC06 CPMSFTSQLC08 CPMSFTSQLC16 CPMSFTSQLC18 CPMSFTSQLC20 CPMSFTSQLC21 CPMSFTSQLC22 CPMSFTSQLC23 Cisco 7000 Port 1/1 FE4/1/0 Port 1/2 Port 2/12 ICPMSCBAC5502 ICPCMGTC7501 Catalyst 5000 IIS Port 1/1 210 Application 2 Exchange 24 ICPMSCBAC5001 ICPCMGTC7502 Network/Monitoring SQL 12 120 Search 2 NetShow 3 NNTP 16 SMTP Stagers 18 Total 6 26 459 A Schematic of HotMail MSERVS Front MSERVS Doors MSERVS Local Director Passport Ad-rotator Internet Mail gateways … • ~ 1B messages per day • 150M mailboxes, 100M active • ~400,000 new per day. MSERVS Graphics MSERVS Servers MSERVS MSERVS AD Servers Local Director Local Director Swittched Ethernet – – – – Local Director Internet • ~7,000 servers • 100 backend stores with 120TB (cooked) • 3 data centers • Links to Data Data Data Data USTORES Incoming MSERVS MSERVS MailServer s Local Director Local Director MSERVS Login MSERVS Servers Member Directory Telnet Management 19 Why (2) Velocity • • • • No project can take more than 13 weeks. Time to market is everything Functionality is everything Faster, cheaper, badder Functionality Schedule 20 Quality Why (3) Hackers • • • • • Hacker’s are a new increased threat Any site can be attacked from anywhere Motives include ego, malice, and greed. Complexity makes it hard to protect sites. Concentration of wealth makes attractive target: • Why did you rob banks? • Willie Sutton: Cause that’s where the money is! Note: Eric Raymond’s How to Become a Hacker http://www.tuxedo.org/~esr/faqs/hacker-howto.html is the positive use of the term, here I mean malicious and anti-social hackers. 21 How Bad Is It? http://www-iepm.slac.stanford.edu/ Connectivity is poor. 22 How Bad Is It? http://www-iepm.slac.stanford.edu/pinger/ • Median monthly % ping packet loss for 2/ 99 23 Microsoft.Com • Operations mis-configured a router • Took a day to diagnose and repair. • DOS attacks cost a fraction of a day. • Regular security patches. 24 BackEnd Servers are More Stable • Generally deliver 99.99% • TerraServer for example Total Up Time Year 1 Total Down Time single back-end Total Time failed after 2.5 y. • Went to 4-node Through 18 cluster Total Time Months Total Down • Fails every 2 mo. Transparent failover in 30 sec. Online software upgrades So… 99.999% in backend… Scheduled Down Scheduled Availabilty Un-Scheduled Down Up Time Scheduled Down Unscheduled Down Time % 8754:07:22 99.93% 5:52:38 0.07% 8760:00:00 2:50:45 100.00% 8757:09:15 3:01:53 99.97% Time 12888:21:49 4:00:25 58:20:46 % 99.519% 0.031% 0.451% 12950:43:00 99.52% 62:21:11 0.48% Down 30 hours in July (hardware stop, auto restart failed, operations failure) Down 26 hours in September (Backplane failure, I/O Bus failure) 25 eBay: A very honest site http://www2.ebay.com/aw/announce.shtml#top • • • • • Publishes operations log. Has 99% of scheduled uptime Schedules about 2 hours/week down. Has had some operations outages Has had some DOS problems. 26 Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations 27 Not to throw stones but… • Everyone has a serious problem. • The BEST people publish their stats. • The others HIDE their stats (check Netcraft to see who I mean). • We have good NODE-level availability 5-9s is reasonable. • We have TERRIBLE system-level availability 2-9s is the goal. 28 Recommendation #1 • Continue progress on back-ends. – Make management easier (AUTOMATE IT!!!) – Measure – Compare best practices – Continue to look for better algoritims. • Live in fear – We are at 10,000 node servers – We are headed for 1,000,000 node servers 29 Recommendation #2 • Current security approach is unworkable: – Anonymous clients – Firewall is clueless – Incredible complexity • We cant win this game! • So change the rules (redefine the problem): – No anonymity – Unified authentication/authorization model – Single-function devices (with simple interfaces) – Only one-kind of interface (uddi/wsdl/soap/…).30 References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9 http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent. http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off! 31