Dependability in the Internet Era 1

advertisement
Dependability
in the
Internet Era
1
Outline
• The glorious past (Availability Progress)
• The dark ages (current scene)
• Some recommendations
2
Preview
The Last 5 Years: Availability Dark Ages
Ready for a Renaissance?
• Things got better, then things got a lot worse!
99.999%
Availability
99.999%
99.99%
Cell
phones
99.9%
99%
Internet
9%
1950
1960
1970
1980
1990
3
2000
DEPENDABILITY: The 3 ITIES
• RELIABILITY / INTEGRITY:
Does the right thing.
(also MTTF>>1)
• AVAILABILITY:
Does it now.
(also 1 >>
Integrity Security
Reliability
MTTR
)
MTTF+MTTR
Availability
System Availability:
If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
• Holistic vs. Reductionist view
4
Fail-Fast is Good, Repair is Needed
Lifecycle of a module
fail-fast gives
short fault latency
High Availability
is low UN-Availability
Unavailability ~ MTTR
MTTF
Improving either MTTR or MTTF gives
benefit
Simple redundancy does not help much.
5
Fault Model
• Failures are independent
So, single fault tolerance is a big win
• Hardware fails fast (dead disk, blue-screen)
• Software fails-fast (or goes to sleep)
• Software often repaired by reboot:
– Heisenbugs
• Operations tasks: major source of outage
– Utility operations
– Software upgrades
6
Disks (raid) the BIG Success Story
• Duplex or Parity: masks faults
• Disks @ 1M hours (~100 years)
• But
– controllers fail and
– have 1,000s of disks.
• Duplexing or parity, and dual path gives “perfect
disks”
• Wal-Mart never lost a byte
(thousands of disks, hundreds of failures).
• Only software/operations mistakes are left.
7
Fault Tolerance vs Disaster Tolerance
• Fault-Tolerance: mask local faults
– RAID disks
– Uninterruptible Power Supplies
– Cluster Failover
• Disaster Tolerance: masks site failures
– Protects against fire, flood, sabotage,..
– Redundant system and service
at remote site.
8
Case Study - Japan
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi
Watanabe).
Vendor
4 2%
Tele Comm
lines
12 %
2 5%
Application
Software
11.2
%
Environment
9.3%
Operations
Vendor (hardware and software)
Application software
Communications lines
Operations
Environment
5 Months
9 Months
1.5 Years
2 Years
2 Years
10 Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas
9
Case Studies - Tandem Trends
MTTF improved
Shift from Hardware & Maintenance to from 50% to 10%
to
Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of
Environment
Operations errors
10
Application Software
Dependability Status circa 1995
• ~4-year MTTF => 5 9s for well-managed sys.
Fault Tolerance Works.
• Hardware is GREAT (maintenance and MTTF).
• Software masks most hardware faults.
• Many hidden software outages in operations:
–New Software.
–Utilities.
• Make all hardware/software changes ONLINE.
• Software seems to define a 30-year MTTF ceiling.
• Reasonable Goal: 100-year MTTF.
class 4 today => class 6 tomorrow.11
What’s Happened Since Then?
• Hardware got better
• Software got better
(even though it is more complex)
• Raid is standard,
Snapshots coming standard
• Cluster in a box: commodity failover
• Remote replication is standard.
12
Availability
99999
Availability
Un-managed
well-managed nodes
Masks some hardware failures
well-managed packs & clones
Masks hardware failures,
Operations tasks (e.g. software upgrades)
Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire, move,…)
Masks some operations failures
13
Outline
• The glorious past (Availability Progress)
• The dark ages (current scene)
• Some recommendations
14
Progress?
• MTTF improved from 1950-1995
• MTTR has not improved much
since 1970 failover
• Hardware and Software online change
(pNp) is now standard
• Then the Internet arrived:
– No project can take more than 3 months.
– Time to market is everything
– Change is good.
15
The Internet Changed Expectations
1990
Phones delivered 99.999%
ATMs delivered 99.99%
Failures were
front-page news.
Few hackers
Outages last an “hour”
2000
Cellphones deliver 90%
Web sites deliver 98%
Failures are
business-page news
Many hackers.
Outages last a “day”
This is progress?
16
Why (1) Complexity
• Internet sites are MUCH
more complex.
–
–
–
–
–
–
–
–
NAP
Firewall/proxy/ipsprayer
Web
DMZ
App server
DB server
Links to other sites
tcp/http/html/dhtml/dom/xml/co
m/corba/cgi/sql/fs/os…
• Skill level is much reduced
17
One of the Data Centers (500 servers)
Ca nyon Pa rk Da ta Center
Mic rosoft.c om Netw ork Dia g ra m
4A2
1A2
SD
SD
SYSTEMS
SYSTEMS
ASX-1000
ASX-1000
3D2
A
A
A
A
C
SER
SER
SER
SER
ETH
ETH
ETH
ETH
TX
C
TX
C
TX
C
TX
C
SELECT
SELECT
SELECT
SELECT
PWR
PWR
PWR
PWR
PWR
PWR
PWR
PWR
AC
48V DC
5VDC OK
5VDC OK
SHUTDOWN
SHUTDOWN
RX
L
RX
L
RX
L
RX
L
NEXT
NEXT
NEXT
NEXT
SELECT
SELECT
SELECT
SELECT
AC
48V DC
RESET
RESET
RESET
RESET
RX
L
RX
L
RX
L
RX
L
NEXT
NEXT
NEXT
NEXT
2A2
3A2
C
ETH
ETH
ETH
ETH
TX
C
TX
C
TX
C
TX
C
SHUTDOWN
C
SER
SER
SER
SER
5VDC OK
RESET
RESET
RESET
RESET
AC
48V DC
5VDC OK
SHUTDOWN
C
A
A
A
A
C
C
C
C
AC
48V DC
4A2
Cisco 7000
1A2
CAUTION:Double Pole/neutral fusing
F12A/250V
CAUTION:Double Pole/neutral fusing
CAUTION:Double Pole/neutral fusing
F12A/250V
F12A/250V
F12A/250V
FE4/0/0
D
HSRP
D
D
D
B
B
B
B
D
D
D
D
B
B
B
B
ICPMDISTFA1001
Cisco 7000
FE5/0/0
CAUTION:Double Pole/neutral fusing
HSRP
ICPMDISTFA1002
FE5/0/0
2A2
ATM0/0/0.1
ATM0/0/0.1
ATM0/0/0.1
Cisco 7000
Cisco 7000
FE4/0/0
ATM0/0/0.1
Cisco 7000
HSRP
HSRP
Cisco 7000
ICPMSCOMC7505
FE4/1/0
FE4/1/0
FE4/1/0
ICPMSCOMC7506
ATM0/0/0.1
FE4/1/0
ATM0/0/0.1
Port 1/2
ICPMSCOMC7502
ICPMSCOMC7503
Port 1/1
Port 1/1
Port 2/1
Port 2/1
Catalyst
5000
Catalyst
5000
Port 2/1
DOWNLOAD.MICROSOFT.COM
Catalyst
5000
CPMSFTWBD01
CPMSFTWBD05
CPMSFTWBD06
ICPMSCOMC5002
(MSCOM2)
ICPMSCOMC5003
(MSCOM3)
CPMSFTWBA02
ICPMSCOMC5004
(MSCOM4)
IIS
DOWNLOAD.MICROSOFT.COM
IIS
CPMSFTWBD07
CPMSFTWBD08
ACTIVEX.MICROSOFT.COM
ICPMSCOMC5001
(MSCOM1)
ICPMSFTDLC2922
(MSCOM DL2)
FE4/0/0
FE4/0/0
Port 1/1
Port 2/1
Port 1/1
Ca ta lyst 2926
ICPMSFTDLC2921
(MSCOM DL1)
HSRP
FE4/0/0
Port 1/1
Port 1/2
Ca ta lyst 2926
ICPMSCOMC7504
HSRP
FE4/0/0
Catalyst
5000
Port 1/1
HSRP
HSRP
ICPMSCOMC7501
CPMSFTWBD03
CPMSFTWBD04
CPMSFTWBD09
IIS
IIS
IIS
IIS
IIS
IIS
CPMSFTWBD10
CPMSFTWBD11
HTMLNEWS(pvt).MICROSOFT.COM
CPMSFTWBA03
CPMSFTWBV01
CPMSFTWBV02
CPMSFTWBV03
CPMSFTWBV04
CPMSFTWBV05
NTSERVICEPACK.MICROSOFT.COM
CPMSFTWBH01
CPMSFTWBH02
WWW.MICROSOFT.COM
CPMSFTWBW26
CPMSFTWBW28
CPMSFTWBW30
IIS
IIS
CPMSFTWBW37
CPMSFTWBW38
CPMSFTWBW39
IIS
IIS
REGISTER.MICROSOFT.COM
CPMSFTWBR01
CPMSFTWBR02
CPMSFTWBR06
WWW.MICROSOFT.COM
WWW.MICROSOFT.COM
CPMSFTWBW24
CPMSFTWBW31
CPMSFTWBW32
CPMSFTWBW33
CPMSFTWBW34
CPMSFTWBR07
CPMSFTWBR08
CPMSFTWBW08
CPMSFTWBW13
CPMSFTWBW14
CPMSFTWBW29
CPMSFTWBW35
CPMSFTWBW40
CPMSFTWBW41
CPMSFTWBW42
CPMSFTWBW43
CPMSFTWBW01
CPMSFTWBW15
CPMSFTWBW25
IIS
CPMSFTWBT03
CPMSFTWBT07
IIS
IIS
CPMSFTWBT04
CPMSFTWBT05
CPMSFTWBT06
CPMSFTWBT08
IIS
IIS
CPMSFTWBP01
CPMSFTWBP02
CPMSFTWBP03
WINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBJ06
CPMSFTWBJ07
CPMSFTWBJ08
IIS
CPMSFTWBS01
CPMSFTWBS02
CPMSFTWBS03
CPMSFTWBS04
CPMSFTWBS05
CPMSFTWBS06
CPMSFTWBS07
CPMSFTWBS08
CPMSFTWBS09
CPMSFTWBS10
CPMSFTWBS11
CPMSFTWBS12
CPMSFTWBS13
CPMSFTWBS14
CPMSFTWBS15
CPMSFTWBS16
CPMSFTWBS17
CPMSFTWBS18
IIS
IIS
CPMSFTWBJ06
CPMSFTWBJ07
CPMSFTWBJ08
KBSEARCH.MICROSOFT.COM
CPMSFTWBJ09
CPMSFTWBJ10
IIS
CPMSFTWBT40
CPMSFTWBT41
CPMSFTWBT42
IIS
CPMSFTWBO30
CPMSFTWBO31
CPMSFTWBG03
CPMSFTWBG04
CPMSFTWBG05
CPMSFTWBAM03
CPMSFTWBAM04
SvcsWINDOWSMEDIA.MICROSOFT.COM
IIS
IIS
IIS
IIS
CPMSFTWBO01
CPMSFTWBO02
IIS
IIS
IIS
CPMSFTWBN01
CPMSFTWBN02
CPMSFTWBO32
CPMSFTWBG04
CPMSFTWBG05
CPMSFTWBB01
CPMSFTWBB03
Catalyst
5000
CPMSFTWBV42
Catalyst
5000
INSIDER.MICROSOFT.COM
CPMSFTWBI02
IUSCCMQUEC5001
(COMMUNIQUE1)
SearchMCSP.MICROSOFT.COM
CPMSFTWBB04
CPMSFTWBV41
IIS
CPMSFTWBN03
CPMSFTWBN04
CPMSFTWBI01
IUSCCMQUEC5002
(COMMUNIQUE2)
CPMSFTWBM03
BACKOFFICE.MICROSOFT.COM
CPMSFTWBO04
CPMSFTWBO07
IIS
CPMSFTWBJ03
CPMSFTWBJ05
MSDN.MICROSOFT.COM
IIS
CPMSFTWBV23
MSDNSupport.MICROSOFT.COM
CPMSFTWBC03
CPMSFTWBY03
CPMSFTWBY04
CPMSFTWBJ01
CPMSFTWBJ02
CGL.MICROSOFT.COM
CPMSFTWBT43
CPMSFTWBT44
OFFICEUPDATE.MICROSOFT.COM
CPMSFTWBV21
CPMSFTWBV22
WINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBJ09
CPMSFTWBJ10
PremOFFICEUPDATE.MICROSOFT.COM
IIS
ASKSUPPORT.MICROSOFT.COM
CPMSFTWBAM01
CPMSFTWBAM01
CPMSFTWBY01
CPMSFTWBY02
CPMSFTWBJ01
WINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBC01
CPMSFTWBC02
CPMSFTWBR09
CPMSFTWBR10
CPMSFTFTPA05
CPMSFTFTPA06
MSDNNews.MICROSOFT.COM
WINDOWS.MICROSOFT.COM
WINDOWS98.MICROSOFT.COM
PREMIUM.MICROSOFT.COM
CPMSFTFTPA01
REGISTER.MICROSOFT.COM
CPMSFTWBR03
CPMSFTWBR04
CPMSFTWBR05
IIS
SUPPORT.MICROSOFT.COM
HOTFIX.MICROSOFT.COM
CPMSFTWBW27
CPMSFTWBW46
CPMSFTWBW47
CDMICROSOFT.COM
IIS
CPMSFTWBT01
CPMSFTWBT02
CPMSFTFTPA03
CPMSFTFTPA04
WWW.MICROSOFT.COM
IIS
CPMSFTWBW36
CPMSFTWBW44
CPMSFTWBW45
SUPPORT.MICROSOFT.COM
SEARCH.MICROSOFT.COM
IIS
IIS
FTP.MICROSOFT.COM
CPMSFTWBH03
NEWSLETTERS.MICROSOFT.COM
IIS
IIS
IIS
IIS
WINDOWS_Redir.MICROSOFT.COM
CPMSFTSMTPQ01
IIS
IIS
IIS
IIS
IIS
IIS
IIS
IIS
CPMSFTSMTPQ02
CPMSFTWBY05
NEWSWIRE.MICROSOFT.COM
CPMSFTWBJ21
CPMSFTWBJ22
CPITGMSGR01
CODECS.MICROSOFT.COM
CPMSFTWBJ16
CPMSFTWBJ17
CPMSFTWBJ18
CPITGMSGR02
COMMUNITIES.MICROSOFT.COM
CPMSFTWBJ19
CPMSFTWBJ20
CPMSFTNGXA01
CPMSFTNGXA02
CPMSFTNGXA03
IIS
IIS
IIS
CPMSFTNGXA04
CPMSFTNGXA05
IIS
Microsoft.com Stagers,
Build and Misc. Servers
Build Servers
NEWSLETTERS
CPMSFTSMTPQ11
CPMSFTSMTPQ12
CPMSFTSMTPQ13
CPMSFTSMTPQ14
CPMSFTSMTPQ15
NEWSWIRE
CPMSFTWBQ01
CPMSFTWBQ02
CPMSFTWBQ03
INTERNAL SMTP
CPMSFTSMTPR01
CPMSFTSMTPR02
NEWSWIRE
CPITGMSGD01
CPITGMSGD02
CPITGMSGD03
STATS
CPITGMSGD04
CPITGMSGD05
CPITGMSGD07
CPITGMSGD14
CPITGMSGD15
CPITGMSGD16
CPMSFTSTA14
CPMSFTSTA15
CPMSFTSTA16
COMMUNITIES
Dra w n b y: Ma tt Gro sho ng
La st Up d a te d : Ap ril 12, 2000
IP a d d re sse s re m o ve d b y Jim Gra y
to p ro te c t se c urity
INTERNET-BUILD
INTERNET-BUILD1
INTERNET-BUILD2
INTERNET-BUILD3
INTERNET-BUILD4
INTERNET-BUILD5
INTERNET-BUILD6
INTERNET-BUILD7
INTERNET-BUILD8
INTERNET-BUILD9
INTERNETBUILD10
INTERNETBUILD11
INTERNETBUILD12
INTERNETBUILD13
INTERNETBUILD14
INTERNETBUILD15
INTERNETBUILD16
INTERNETBUILD17
INTERNETBUILD18
INTERNETBUILD19
INTERNETBUILD20
INTERNETBUILD21
INTERNETBUILD22
INTERNETBUILD23
INTERNETBUILD24
INTERNETBUILD25
INTERNETBUILD26
INTERNETBUILD27
INTERNETBUILD30
INTERNETBUILD31
INTERNETBUILD32
INTERNETBUILD34
INTERNETBUILD36
INTERNETBUILD42
Stagers
CPMSFTCRA10
CPMSFTCRA14
CPMSFTCRA15
CPMSFTCRA32
CPMSFTCRB02
CPMSFTCRB03
CPMSFTCRP01
CPMSFTCRP02
CPMSFTCRP03
CPMSFTCRS01
CPMSFTCRS02
CPMSFTCRS03
CPMSFTSGA01
CPMSFTSGA02
CPMSFTSGA03
CPMSFTSGA04
CPMSFTSGA07
PPTP / Terminal Servers
CPMSFTPPTP01
CPMSFTPPTP02
CPMSFTPPTP03
CPMSFTPPTP04
CPMSFTTRVA01
CPMSFTTRVA02
CPMSFTTRVA03
Monitoring Servers
CPMSFTHMON01
CPMSFTHMON02
CPMSFTHMON03
CPMSFTMONA01
CPMSFTMONA02
CPMSFTMONA03
Microsoft.com Server Count
Microsoft.com SQL Servers
FTP
6
Build Servers
32
Live SQL Servers
CPMSFTSQLA05
CPMSFTSQLA06
CPMSFTSQLA08
CPMSFTSQLA09
CPMSFTSQLA14
CPMSFTSQLA16
CPMSFTSQLA18
CPMSFTSQLA20
CPMSFTSQLA21
CPMSFTSQLA22
CPMSFTSQLA23
CPMSFTSQLA24
CPMSFTSQLA25
CPMSFTSQLA26
CPMSFTSQLA27
CPMSFTSQLA36
CPMSFTSQLA37
CPMSFTSQLA38
CPMSFTSQLA39
Backup SQL Servers
SQL
SQL
SQL
SQL
CPMSFTSQLB05
CPMSFTSQLB06
CPMSFTSQLB08
CPMSFTSQLB09
CPMSFTSQLB14
CPMSFTSQLB16
CPMSFTSQLB18
CPMSFTSQLB20
CPMSFTSQLB21
Misc. SQL Servers
CPMSFTSQLD01
CPMSFTSQLD02
CPMSFTSQLE01
CPMSFTSQLF01
CPMSFTSQLG01
CPMSFTSQLH01
CPMSFTSQLH02
CPMSFTSQLH03
CPMSFTSQLH04
CPMSFTSQLI01
CPMSFTSQLL01
CPMSFTSQLM01
CPMSFTSQLM02
CPMSFTSQLP01
CPMSFTSQLP02
CPMSFTSQLP03
CPMSFTSQLP04
CPMSFTSQLP05
CPMSFTSQLQ01
CPMSFTSQLQ06
CPMSFTSQLR01
CPMSFTSQLR02
CPMSFTSQLR03
CPMSFTSQLR05
CPMSFTSQLR06
CPMSFTSQLR08
CPMSFTSQLR20
CPMSFTSQLS01
CPMSFTSQLS02
CPMSFTSQLW01
CPMSFTSQLW02
CPMSFTSQLX01
CPMSFTSQLX02
CPMSFTSQLZ01
CPMSFTSQLZ02
CPMSFTSQLZ04
CPMSFTSQL01
CPMSFTSQL02
CPMSFTSQL03
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
CPMSFTSQLC24
CPMSFTSQLC25
CPMSFTSQLC26
CPMSFTSQLC27
CPMSFTSQLC30
CPMSFTSQLC36
CPMSFTSQLC37
CPMSFTSQLC38
CPMSFTSQLC39
Cisco 7000
Catalyst
5000
FE4/1/0
CPMSFTSQLB22
CPMSFTSQLB23
CPMSFTSQLB24
CPMSFTSQLB25
CPMSFTSQLB26
CPMSFTSQLB27
CPMSFTSQLB36
CPMSFTSQLB37
CPMSFTSQLB38
CPMSFTSQLB39
Consolidator SQL Servers
CPMSFTSQLC02
CPMSFTSQLC03
CPMSFTSQLC06
CPMSFTSQLC08
CPMSFTSQLC16
CPMSFTSQLC18
CPMSFTSQLC20
CPMSFTSQLC21
CPMSFTSQLC22
CPMSFTSQLC23
Cisco 7000
Port 1/1
FE4/1/0
Port 1/2
Port 2/12
ICPMSCBAC5502
ICPCMGTC7501
Catalyst
5000
IIS
Port 1/1
210
Application
2
Exchange
24
ICPMSCBAC5001
ICPCMGTC7502
Network/Monitoring
SQL
12
120
Search
2
NetShow
3
NNTP
16
SMTP
Stagers
18
Total
6
26
459
A Schematic of HotMail
MSERVS
Front
MSERVS
Doors
MSERVS
Local Director
Passport
Ad-rotator
Internet Mail gateways
…
• ~ 1B messages per day
• 150M mailboxes, 100M active
• ~400,000 new per day.
MSERVS
Graphics
MSERVS
Servers
MSERVS
MSERVS
AD Servers
Local Director
Local Director
Swittched Ethernet
–
–
–
–
Local Director
Internet
• ~7,000 servers
• 100 backend stores
with 120TB (cooked)
• 3 data centers
• Links to
Data
Data
Data
Data
USTORES
Incoming
MSERVS
MSERVS
MailServer
s
Local Director
Local Director
MSERVS
Login
MSERVS
Servers
Member
Directory
Telnet Management
19
Why (2) Velocity
•
•
•
•
No project can take more than 13 weeks.
Time to market is everything
Functionality is everything
Faster, cheaper, badder  Functionality
Schedule
20
Quality
Why (3) Hackers
•
•
•
•
•
Hacker’s are a new increased threat
Any site can be attacked from anywhere
Motives include ego, malice, and greed.
Complexity makes it hard to protect sites.
Concentration of wealth makes attractive target:
• Why did you rob banks?
• Willie Sutton: Cause that’s where the money is!
Note: Eric Raymond’s How to Become a Hacker
http://www.tuxedo.org/~esr/faqs/hacker-howto.html
is the positive use of the term, here I mean malicious and anti-social hackers.
21
How Bad Is It?
http://www-iepm.slac.stanford.edu/
Connectivity is poor.
22
How Bad Is It?
http://www-iepm.slac.stanford.edu/pinger/
• Median monthly % ping packet loss for 2/ 99
23
Microsoft.Com
• Operations mis-configured
a router
• Took a day to diagnose
and repair.
• DOS attacks cost a
fraction of a day.
• Regular security
patches.
24
BackEnd Servers are More Stable
• Generally deliver 99.99%
• TerraServer for example
Total Up Time
Year 1 Total Down Time
single back-end
Total Time
failed after 2.5 y.
• Went to 4-node
Through
18
cluster
Total Time
Months
Total Down
• Fails every 2 mo.
Transparent
failover in 30 sec.
Online software upgrades
So… 99.999% in backend…
Scheduled Down
Scheduled Availabilty
Un-Scheduled Down
Up Time
Scheduled Down
Unscheduled Down
Time
%
8754:07:22
99.93%
5:52:38
0.07%
8760:00:00
2:50:45
100.00%
8757:09:15
3:01:53
99.97%
Time
12888:21:49
4:00:25
58:20:46
%
99.519%
0.031%
0.451%
12950:43:00 99.52%
62:21:11 0.48%
Down 30 hours in July (hardware stop, auto restart failed,
operations failure)
Down 26 hours in September (Backplane failure, I/O Bus failure)
25
eBay: A very honest site
http://www2.ebay.com/aw/announce.shtml#top
•
•
•
•
•
Publishes operations log.
Has 99% of scheduled uptime
Schedules about 2 hours/week down.
Has had some operations outages
Has had some DOS problems.
26
Outline
• The glorious past (Availability Progress)
• The dark ages (current scene)
• Some recommendations
27
Not to throw stones but…
• Everyone has a serious problem.
• The BEST people publish their stats.
• The others HIDE their stats
(check Netcraft to see who I mean).
• We have good NODE-level availability
5-9s is reasonable.
• We have TERRIBLE system-level availability
2-9s is the goal.
28
Recommendation #1
• Continue progress on back-ends.
– Make management easier
(AUTOMATE IT!!!)
– Measure
– Compare best practices
– Continue to look for better algoritims.
• Live in fear
– We are at 10,000 node servers
– We are headed for 1,000,000 node servers
29
Recommendation #2
• Current security approach is unworkable:
– Anonymous clients
– Firewall is clueless
– Incredible complexity
• We cant win this game!
• So change the rules (redefine the problem):
– No anonymity
– Unified authentication/authorization model
– Single-function devices (with simple interfaces)
– Only one-kind of interface (uddi/wsdl/soap/…).30
References
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and
Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in
Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on
Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan
Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An
Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium
on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.
Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings
of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9
http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.
http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices
http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!
31
Download