Digest of Fast Abstracts, International Conference on Dependable Systems and Networks, New York, June, 2000 Dependabilty Analysis for E-Commerce Site Business Decisions Myron Hecht, Yutao He, Xuegao An SoHaR Incorporated, Beverly Hills, CA {myron, yutao, xuegao}@sohar.com 1. Introduction Application of standard dependability analysis techniques to e-commerce sites can provide significant non-obvious answers to business questions related to ecommerce system architecture. This paper provides an example. allows a model to be composed as a hierarchy of Reliability Block and Markov Models representing each of the major elements. Hierarchical models allow simpler, easily verified models to be combined to represent a complex system, and represents the most important benefit of hierarchical modeling. 2. Example System Top Level Model (Block Dgm) For an example system, we use a simplified representation of the eBay e-commerce web site shown in Figure 1 [1]. Uunet and Sprint Internet backbones are connected to redundant Cisco routers and then a set of (Compaq) frontend web servers running on Windows NT. The web servers in turn pass requests on auctions and/or bids to redundant Oracle Database Management System (DBMS) servers hosted on Sun Starfire servers These DBMSs in turn retrieve and update tables representing each of the auctions. The outcomes of the auctions are recorded in the Anaconda system that then participants by email. Buyer's ISP Backbone Model (Markov) Gateway Model (Markov) Front end Server (Markov) Server (Block Dgm) DBMS Server Block Dgm Hardware (Markov) Solaris_OS (Markov) Software (Markov) Figure 2. Organization of Hierarchical Model Figure 3 shows an example of one of the submodels for the front-end web server farm. Loss of a single server results in a loss of capacity and is shown by the reward values in the state representations. Di agr am Fr o n t_ E n d_ Ser ver s In i ti al State: S0 Fai lu r e State: S7 Seller's ISP Seller Submits Item Buyer Submits Bid (N r s-1)*ë r s ìr s Sprint Backbone Backbone S1 UUnet Backbone N r s*ë r s 0 .9 5 (N r s- 2 )*ë r s (N r s-3 )*ë r s S2 ìr s 0 .9 S3 ìr s 0 .85 S4 ìr s 0 .8 Cisco Switches Cisco Mirrored Switches/Routers S0 Cisco Mirrored Switches/Routers (N r s- 4 )*ë r s ìr s 1 (N r s-6)*ë r s Front End Servers ... Compaq Front End Server Compaq Compaq Front End Server Front End Server S7 Shared RAID DBMS Servers Sun Starfire/DBMS "Bull" Shared RAID 0 Compaq Front End Server Sun Starfire/DBMS "Bearl" System Boundary (N r s-5)*ë r s S6 ìr s 0 .7 S5 ìr s 0 .75 P C _ ser ver ërs ìr s Figure 3. Front End Processor Model Shared RAID Sun Starfire/DBMS "Anaconda" Shared RAID Figure 1 eBay System Configuration 3. Model Description The system model was created using the MEADEP dependability analysis tool developed by SoHaR [3] Tool ©2000, SoHaR Incorporated Imperfect recovery and finite recovery times in activestandby systems such as database servers were modeled using coverage parameters Details of these models are described in [2] and [3]. 4. Modeling Results The results of the model using representative (but not actual eBay site) parameters include an overall system availability of 0.9957. The site is predicted to be off-line Digest of Fast Abstracts, International Conference on Dependable Systems and Networks, New York, June, 2000 every 61 hours with an annual downtime of about 38 hours. With 650 auction bids per minute (as of October, 1999 [1]), this downtime could amount to more than 1.46 million lost transactions. Reducing downtime has a significant financial impact. However, not all systems contribute equally. Figure 4 shows the failure rates by subsystem and shows that the highest failure rate is the front-end server PC subsystem. Although a natual conclusion is that this subsystem should have the highest priority for being addressed, the WAN backbone subsystem actually had the highest impact on site availability. The management benefit of such modeling becomes apparent when downtime is translated into lost transactions such is as shown in Figure 6. With a restoration time of 30 minutes, the expected number of lost transactions per year is approximately 1.2 million. However, as the restoration time approaches 5 hours, the number of lost transactions increases to 2.8 million. If the value of each transaction is $0.10, the value of a service level agreement that guarantees a restoration time of 0.5 hours or less is more than $160,000 per year. 15,000 10,000 5,000 0 50 100 150 200 250 300 350 WAN Service Provider Backbone Restoration Time (Minutes) Figure 6. Impact of Restoration Time Subsystem Figure 4. Results by Subsystem This result was found using parametric analysis and demonstrates its importance in defining an overall availability strategy. The Model Evaluator (ME) module of MEADEP facilitates such analyses for design and operations tradeoffs. Figure 5 shows the annual downtime as a function of the service provider as the repair rate is being varied from 0.5 per hour (corresponding to a repair time of 2 hours) to a repair rate of 2 per hour (corresponding to a repair time of 0.5 hours). The ordinate shows the expected downtime decreases from approximately 38 hours to approximately 30. This result may be quite significant when negotiating a service level agreement. 38 37 Yearly-Downtime (hours) 20,000 0 DB Servers Front End Servers Cisco Switches Backbone 0.008 0.007 0.006 Failure 0.005 Rate (per 0.004 hour) 0.003 0.002 0.001 0 Monthly Revenue Loss ($) 25,000 36 35 34 The model can be used to evaluate the value the additional capacity. Increasing the capacity of both backbone WANs so that either could handle the full load (as opposed to either handling half the load) could decrease the monthly revenue loss by as much as $30,000 or as little as $9,000 depending on the reliability of the service providers. The lower the reliability (MTBO), the greater the benefit of increasing the capacity. With these results, it is also possible to assess the value of a service level agreement with the additional dimension of capacity. For example, it may be that provisioning each of the backbones with double the capacity is a lower cost option than securing service level agreements with both WAN backbone service providers. With the MEADEP reward function, it is also possible to consider other options such as providing full capacity in the first backbone and half capacity in the second one, or providing 75% capacity in each. The optimum configuration depends on the traffic profile and the MTBO and restoration times of the WAN backbones, the load profile of the server, the value of the traffic, and other system-specific factors. 33 References 32 [1] Joseph Menn, “Prevention of Online Crashes Is No Easy Fix”, Los Angeles Times, October 16, 1999, Section C, Page 1 31 30 0.5 0.8 1.1 1.4 1.7 2 Service Provider Repair Rate (per hour) Figure 5. Yearly System Downtime as a Function of Service Provider Repair Rates [2] D. Tang, et. al., Experience in Using MEADEP, Proc. 1999 Reliability & Maint. Symp. Wash., DC, 1999 [3] MEADEP description, www.sohar.com/meadep, 2000