the e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University Reliability If it stops ticking when it takes a licking… your e-commerce company could tank So you need to know that your technology base is reliable It does what it should do, does it when needed, does it correctly, and is accessible to your customers. A Quiz Q: When and why did Sun Microsystems have a losing quarter? Ken Birman: Mr. Birman, Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems. Andrew Casey Manager, Investor Relations Sun Microsystems, Inc. (650) 336-0761 andrew.casey@corp.sun.com Typical Web Session get http://www.cs.cornell.edu/People/ken where firewall what Typical Web Session DNS root resolve “www.cs.cornell.edu” DNS leaf caching proxy firewall DNS node DNS node DNS root DNS root DNS leaf IP address 128.64.31.77 caching proxy DNS leaf load-balancing proxy web server web server The Web’s dark side Netscape error: web server www.cs.cornell.edu ... not responding. Server may have crashed or is overloaded. OK Right URL, but the request times out. Why? The web server could be down Your network connection may have failed There could be a problem in the “DNS” There could be a network routing problem The Internet may be experiencing an overload Your web caching proxy may be down Your PC might have a problem, or your version of Netscape (or Explorer), or the file system you are using, or your LAN The URL itself may be wrong A router or network link may have failed and the Internet may not yet have rerouted around the problem E-Trade computers crash again -- and again Edupage Editors <edupage@franklin.oit.unc.edu> Sun, 07 Feb 1999 10:28:30 -0500 The computer system of online security firm E-Trade crashed on Friday for the third consecutive day. "It was just a software glitch. I think we were all frustrated by it," says an E-Trade executive. Industry analyst James Mark of Deutsche Bank commented “…it's the application on a large scale. As soon as E-Trade's volumes started spiking up, they had the same problems as others…." Reliable Distributed Computing: Increasingly urgent, yet unsolved Distributed computing has swept the world Impact has become revolutionary Vast wave of applications migrating to networks Already as critical a national infrastructure as water, electricity, or telephones Yet distributed systems remain Unreliable, prone to inexplicable outages Insecure, easily attacked Difficult (and costly) to program, bug-prone A National Imperative Potential for catastrophe cited by Presidential Commission on Critical Infrastructure Protection (PCCIP) National Academy of Sciences Study on Trust in Cyberspace These experts warn that we need a quantum improvement in technologies Meanwhile, your e-commerce venture is at grave risk of stumbling – just like many others A Business Imperative A Business Imperative E-projects Often Fail e-commerce revolves around computing Even business and marketing people are at the mercy of these systems When your company’s computing systems aren’t running, you’re out of business Big and Little Pictures It is too easy to understand “reliability” as a narrow technical issue In fact, many systems and companies stumble by building Unreliable technologies, because of A mixture of poor management and poor technical judgment Reliable systems demand a balance between good management and good technology A Glimpse of Some “Unreliable Systems” Quick review of some failed projects These were characterized by poor reliability of the final product But the issues were not really technical As future managers you need to understand this phenomenon! Tales from the Software Crypt NYC Control of 10,000 Traffic Lights Univac, based on experience in Baltimore and Toronto Started in late 1960’s Scrapped 2-3 years later Spent: ? Second system effect: New radio control system New software, algorithms Earlier systems were 100x smaller Incommensurate scaling California Dept. of Motor Vehicles Vehicle registration and drivers licenses Started in 1987 Scrapped 1994 Spent: $44M Underestimated cost by a factor of 3 Slower than 1965 system Governor fired the whistleblower DMV blames Tandem, Tandem blames DMV United Airlines/UNIVAC Automated reservations, ticketing, flight schedules, fuel delivery, kitchens, and general administration Started in late 1960’s Scrapped early 1970’s Spent: $50M Second system effect: Hilton, Marriot, Budget, American Airlines Hotel res., links to Wizard and Sabre Started: 1988 Scrapped: 1992 Spent: $125M Second system Very dull tools (machine language) Bad-news diode See CACM October 1994 for details CONFIRM Source: Jerry Saltzer, Keynote address, SOSP 1999 Tried to automate everything, including kitchen sink Ditto: Burroughs/TWA. Delta currently planning to build something similar But they will use the web. “Magic bullet” concept… Today uses web, works well Tales from the Software Crypt SACSS (California) State-wide system for automated child support tracking Started 1991 ($99M) “On hold” 1997 Spent: $300M Lockheed and HWDC disagree on what the system contains and which part of it isn’t working Taurus (British Stock Exchange) Replacement for British Stock Exchange Started 1980’s Scrapped 1993 Spent $600M “massive complexity of the back-end settlement systems…” delays and cost overruns IBM Workplace OS for PC Mach 3.0 + binary compatibility with Pink, AIX, DOS, OS/400 + new clock mgt. + new RPC + new I/O + new CPU Started in 1991 Scrapped 1996 Spent: $2B 400 staff on kernel, 1500 elsewhere “sheer complexity of the class structure proved to be overwhelming” Replacement for “in route” air traffic control system Started 1982 Scrapped 1994 Spent more than $6B Management misestimated size and length of project Project goals constantly changed Advanced Automation System (AAS) Source: Jerry Saltzer, Keynote address, SOSP 1999 “Departments shouldn’t deploy a system to additional users if it is not working” Even question of how to represent numbers wasn’t settled Early design choices and compatibility decisions doomed the project Poor technology choices Run by gov’t. bureaucrats 1995 Standish Group Study Over budget Missed schedule Lacks functions “Success” 20% On time On budget On function “Challenged” 50% “Impaired” 30% Scrapped 2x budget 2x competion time 2/3 planned functionality Source: Jerry Saltzer, Keynote address, SOSP 1999 A strange picture Many technology projects fail For lots of reasons But some succeed Today we do web-based hotel reservations all the time, yet “Confirm” failed French air traffic project was a success yet US project lost $6 billion Is there a pattern? Recurring Problems Incommensurate scaling Too many ideas Mythical man-month Bad ideas included Modularity is hard Bad-news diode Best people are far more productive than average employees New is better, not-even-available yet is best Magic bullet syndrome Source: Jerry Saltzer, Keynote address, SOSP 1999 1995 Study of Tandem Computer Systems 77% of failures that are software problems. Software fault-tolerance techniques can overcome about 75% of detected faults. Loose coupling between primary and backup is important for software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Source: Jerry Saltzer, Keynote address, SOSP 1999 A Buggy Aside Q: What are the two main categories of software bugs called? A: Bohrbugs and Heisenbugs Q: Why? Bohr Model of Atom Bohr argued that the nucleus was a little ball Bohr Model of Atom Bohr argued that the nucleus was a little ball Bohr bug is a nasty but well defined thing Bohr Model of Atom Bohr argued that the nucleus was a little ball Bohr bug is a nasty but well defined thing Your technical people can reproduce it, so they can nail it Heisenbug Heisenberg modeled atom as a cloud of electroms and a cloud-like nucleus The closer you look, the more it wiggled A Heisenbug moves when your people try and pin it down. They won’t find it easy to fix. Why? Bohrbugs tend to be deterministic errors – outright mistakes in the code Once you understand what triggers them they are easy to search for and fix Heisenbugs are often delayed side-effects of an old error. Like a bad tank of gas, effect may happen long after the bug first “occurs”. Hard to fix because at the time the mistake happened, nothing obvious went wrong Why Systems fail Mostly, because something crashes Usually, software or a human error Mean time to failure improves with age but software problems remain prevalent Every kind of software system is prone to failures. Failure to plan for failures is the most common way for e-systems to fail. E-reliability We want e-commerce solutions to be reliable… but what should this mean? Fault-tolerant? Secure? Fast enough? Accessible to customers? Deliver critical services when needed, where needed, in a correct, timely manner Costs of a Failure Minimizing Downtime Idea is to design critical parts of your system to survive failures Two basic approaches Recoverable systems are designed to restart without human intervention – but may wait until outage is repaired Highly available systems are designed to keep running during failure Recoverability The technology is called “transactions” We’ll discuss this next time, but… Main issue is time needed to restart the service For a large database, half an hour or more is not at all unusual Faster restart requires a “warm standby” High Availability Idea is to have a way to keep the system running even while some parts are crashed For example, a backup that takes over if primary fails Backup is kept “warm” This involves replicating information As changes occur, backup may lag behind Complexity The looming threat to your e-commerce solution, no matter what it may be Even simple systems are hard to make reliable Complex systems are almost impossible to make reliable Yet innovative e-commerce projects often require fairly complex technologies! Two Side-by-Side Case Studies American Advanced Automation System Intended as replacement for air traffic control system Needed because Pres. Reagan fired many controllers in 1981 But project was a fiasco, lost $6B French Phidias System Similar goals, slightly less ambitious But rolled out, on time and on budget, in 1999 Background Air traffic control systems are using 1970’s technology Extremely costly to maintain and impossible to upgrade Meanwhile, load on controllers is rising steadily Can’t easily reduce load Air Traffic Control system (one site) Onboard Radar Team of Controllers X.500 Directory Air Traffic Database (flight plans, etc) Politics Government wanted to upgrade the whole thing, solve a nagging problem Controllers demanded various simplifications and powerful new tools Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center Technology IBM bid the project, proposed to use its own workstations These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance” Idea is to plan for failure Detect failures when they occur Automatically switch to backups Core Technical Issue? Problem revolves around high availability Waiting for restart not seen as an option: goal is 10sec downtime in 10 years So IBM proposed a replication scheme much like the “load balancing” approach IBM had primary and backup simply do the same work, keeping them in the same state Technology radar find tracks Identify flight Lookup record Plan actions Human action Conceptual flow of system IBM’s fault-tolerant process pair concept radar radar find find tracks tracks Identify Lookup Identify Lookup flight record flight record Plan Plan actions actions Human Human action action Why is this Hard? The system has many “real-time” constraints on it Actions need to occur promptly Even if something fails, we want the human controller to continue to see updates IBM’s technology Based on a research paper by Flaviu Cristian But had never been used except for proof of concept purposes, on a small scale in the laboratory Politics IBM’s proposal sounded good… … and they were the second lowest bidder … and they had the most aggressive schedule So the FAA selected them over alternatives IBM took on the whole thing all at once Disaster Strikes Immediate confusion: all parts of the system seemed interdependent To design part A I need to know how part B, also being designed, will work Controllers didn’t like early proposals and insisted on major changes to design Fault-tolerance idea was one of the reasons IBM was picked, but made the system so complex that it went on the back burner Summary of Simplifications Focus on some core components Postpone worry about fault-tolerance until later Try and build a simple version that can be fleshed out later … but the simplification wasn’t enough. Too many players kept intruding with requirements Crash and Burn The technical guys saw it coming Probably as early as one year into the effort But they kept it secret (“bad news diode”) Anyhow, management wasn’t listening (“they’ve heard it all before – whining engineers!”) The fault-tolerance scheme didn’t work Many technical issues unresolved The FAA kept out of the technical issues But a mixture of changing specifications and serious technical issues were at the root of the problems What came out? In the USA, nothing. The entire system was useless – the technology was of an all-or-nothing style and nothing was ready to deploy British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work… Contrast with French They took a very incremental approach Early design sought to cut back as much as possible If it isn’t “mandatory” don’t do it yet Focus was on console cluster architecture and fault-tolerance They insisted on using off-the-shelf technology Contrast with French Managers intervened in technology choices For example, the vendor wanted to do a home-brew fault-tolerance technology French insisted on a specific existing technology and refused to bid out the work until vendors accepted A critical “good call” as it worked out Learning by Doing To gain experience with technology They tested, and tested, and tested Designed simple prototypes and played with them Discovered that large cluster would perform poorly But found a “sweet spot” and worked within it This forced project to cut back on some goals Testing 9/10th of time and expense on any system is in Testing Debugging Integration Many projects overlook this French planned conservatively Software Bugs Figure 1/10 lines in new code But as many as 1/250 lines in old code Bugs show up under stress Trick is to run a system in an unstressed mode French identified “stress points” and designed to steer far from them Their design also assumed that components would fail and automated the restart All of this worked! Take-aways from French project? Complex technical issues at the core of the system But they managed to break big poject into pieces Do the critical core first, separately, and focus exclusively on it Test, test, test Don’t build anything you can possibly buy Management was technically sophisticated enough to make some critical “calls” Your Problem e-commerce systems are at e-risk These e-risks take many forms: System complexity Failure to plan for failures Poor project management Ignore this at our peril, as we’ve seen But how can we learn to do better? Keys to Reliability Know the basic technologies Realize that software is buggy and failures will happen. Design to treat failure as a mundane event Failure to plan for failure is the biggest e-risk! Complexity is a huge threat. Use your naiveté as an advantage: if you can’t understand it, why assume that “they” can understand it? E-commerce Technologies The network and associated services Databases Web servers “Scripts” – the glue your people use to tie it all together Next Lecture Look at some realistic e-commerce systems Ask ourselves where to start first, if we need to convince ourselves that the system will be reliable enough Trick is to balance between system complexity and adequate risk coverage