Software Failures Financial Chemical Bank’s ATMs 100 000 customers debited twice fault in software underlying cause was bank merger - software changed to cope with merged ATM systems VISA UK data centre programmer error caused hundreds of valid cards to be rejected for several hours Intuit's MacInTax Leaks Financial Secrets: In 1995, Intuit's tax software for Windows and Macintosh suffered a series of bugs, including several that prompted the company to pledge to pay any resulting penalties and interest. The scariest bug was discovered in March 1995: the code included in a MacInTax debug file allowed Unix users to log in to Intuit's master computer, where all MacInTax returns were stored. From there, the user could modify or delete returns. Intuit later ended up winning BugNet's annual bug-fix award in 1996 by responding to bugs faster than any other major vendor. Financial markets considerable risk of financial loss from software failure fast development a necessity TAURUS Banks and banking are highly automated. High street banks are completely reliant on computer systems for all loan, account and transaction processing. Global financial markets are likewise dependent. The flow of capital around the world is completely automated - paper share certificates, bonds and cheques do not often change hands until days or weeks after the transactions have taken place. Confirmation of deals is almost entirely fax or telephone based. The City of London’s Crest system will allow complete electronic trading in stocks and shares. Financial projects of this kind have high dependability requirements. As well as large computer based systems banks are also reliant on small, often single programmer, projects. The derivatives markets continually launch new exotic financial products whose profit opportunities will only be available over a very short time span (usually days). Programmers will often “hack up” an analysis of the product’s yield in a matter of hours (using spreadsheets etc.) and millions of dollars may “move” on the resulting recommendation. In some investment banks the traders might even write their own code. Needless to say the risk is high, but then so are the opportunities. Transport Mariner 1 Venus Probe Loses its Way: In 1962, Mariner 1 launched from Cape Canaveral was set to go to Venus. After takeoff, the unmanned rocket carrying the probe went off course and NASA had to blow up the rocket to avoid endangering lives on earth. NASA later attributed the error to a faulty line of Fortran code. The report stated, "Somehow a hyphen had been dropped from the guidance program loaded aboard the computer, allowing the flawed signals to command the rocket to veer left and nose down… Suffice it to say, the first U.S. attempt at interplanetary flight failed for want of a hyphen." The vehicle cost more than $80 million, prompting Arthur C. Clarke to refer to the mission as "the most expensive hyphen in history." New Denver Airport Misses its Opening: In 1995, the Denver International Airport was intended to be a state-of-the-art airport, with a complex, computerized baggage-handling system and 5,300 miles of fiber-optic cabling. Unfortunately, bugs in the baggage system caused suitcases to be chewed up and drove automated baggage carts into walls. The airport eventually opened 16 months late, $3.2 billion over budget, and with a mainly manual baggage system. NASA’s Space Shuttle • First actual launch 10 April 1984 - 3 years late, millions of $ overspend. •In 1989First“Program planned launch cancelled because of computer synchronisation fault Notes and Waivers” book detailed software faults The NASA Space Shuttle is one of the largest documented software systems in the world - total 4M LOC including computers on ground (150,000 LOC inside shuttle in 5 computers). The total system R&D costs were in excess of $10 billion. Most of the software is developed by IBM. First actual launch: 10 April 1984; 3 years late, millions of $ overspend. First planned launch cancelled - computer synchronisation fault. Fault traced to change made 2 years earlier: • delay factor reset from 50 to 80 millisecs • change meant 1/67 chance of launch failure • failure not noticed during thousands of hours of software testing In 1989 (5 years, 20 flights on) “Program Notes and Waivers”, the book of known software problems was supplied to astronauts. It described a number of faults: • interleaving of two messages on shuttle display screen • use of common buffer area shared by on-board keyboard and ground communications: program or data being uploaded at same time as astronaut causes contents of buffer to be jumbled Since 1989 a massive effort has been made to use the most sophisticated software engineering techniques. This has resulted in significant claimed improvements to the software reliability (backed up by impressive metrics). C-17 cargo Plane (McDonnell Douglas) $500 million over budget due to problems with its avionics software. C-17 included 19 onboard computers, 80 microprocessors, 6 different programming languages Fly-by-wire aircraft (Airbus) • First civilian fly-by-wire aircraft • Computer controls: Electrical Flight Control System (EFCS) qualifies A320 as a fly-by-wire aircraft • Accidents: Habsheim airshow Bangalore Warsaw Strasbourg •The Airbus Poorer safety record than conventional aircraft A320 was the first civilian aeroplane to use a fully computerised flight control system. As the first “fly-by-wire” civilian airliner its safety record has been of great interest and has attracted much comment. Since its launch the A320 has suffered four fatal accidents which may be, at least partially, explained by software failures. A number of potential causes for failure have been identified: the requirements for on-board computers were inadequate where aircraft have been pushed to edge of flight envelope; the on-board systems are more complex than conventional equipment - this affected the way pilots flew the plane (increased the likelihood of pilot error). However, officially, software was not to blame. The table below shows the number of hull losses per million departures over a range of civilian aircraft (See [Mellor 94] for details). Aircraft Model Boeing 757 Boeing 747-400 Boeing 767 (same generation as A320) Airbus A300 Boeing 737 Airbus A320 DC-10 Hull losses per million departures (to Dec 93) 0.00 1.86 0.29 0.98 0.53 2.50 2.67 In September 1994 an A340 inbound for Heathrow experienced a false alarm about fuel immediately followed by two problems in the flight management system and the instrument landing system. The AAIB recommended a significant improvement in hardware and software reliability of the flight management and fuel sub-systems [AAIB 1995]. Military • Star Wars missile crash - Cape Canaveral Aries rocket blown up 23 seconds after it was launched Instead of heading northeast over the Atlantic it sped south Technician accidentally loaded wrong software Cost of launch was $5 million Launch Controllers loaded the wrong computer program into the guidance unit of a “star wars” (SDI) rocket that had to be destroyed when it veered sharply off course. The rocket was blown up 29 seconds after lift off after it was launched with “star wars” experiments. Instead of heading northeast over the Atlantic it sped south. No injuries or property damage on the ground were reported. A technician accidentally hit the wrong key while loading software into rocket’s guidance system. As a result ground test rather than flight software was loaded, causing the steering nozzles to lock in place. No one checked to make sure the right software was loaded. Patriot Missile Misses: In 1991, the U.S. Patriot missile's battery was used to head off Iraqi Scuds during the Gulf War. But the system failed to track several incoming Scud missiles, including one that killed 28 U.S. soldiers in a barracks in Dhahran, Saudi Arabia. The problem stemmed from a software error that put the tracking system off by 0.34 of a second. As Ivars Peterson states in Fatal Defect, the system was originally supposed to be operated for only 14 hours at a time. In the Dhahran attack, the missile battery had been on for 100 hours. This meant that the errors in the system's clock accumulated to the point that the tracking system no longer functioned. The military had in fact already found the problem but hadn't sent the fix in time to prevent the barracks explosion. Medical Radiation Machine Kills Four 1985 to 1987: Therac-25, a radiation-treatment machine made by Atomic Energy of Canada Limited (AECL), resulted in several cancer patients receiving lethal overdoses of radiation. Four patients died. When their families sued, all the cases were settled out of court. A later investigation by independent scientists Nancy Leveson and Clark Turner found that accidents occurred even after AECL thought it had fixed particular bugs. "A lesson to be learned from the Therac-25 story is that focusing on particular software bugs is not the way to make a safe system," they wrote in their report. "The basic mistakes here involved poor software-engineering practices and building a machine that relies on the software for safe operation." • • • Malfunction killed at least two patients; six received severe overdose Software designers did not anticipate use of keyboard’s arrow keys Possible reasons for failure safety analysis neglected to omitted possibility of software fault over confidence in software led to removal of hardware protection programming done to commercial rather than safety-critical standards Radiation therapy aims to destroy cancer by delivering a carefully calculated dose of radiation to the tumor, while minimising the irradiation of the surrounding tissue. Treatment is applied via a computer controlled radiotherapy machine consisting of an X-Ray and electron beam, a control desk and a rotating table. The Therac-25 machine had this basic set-up except that the safety interlocks were implemented in software rather than hardware. Between June 1985 and January 1987 six patients received severe dosages of radiation while being treated on the Therac-25 medical linear accelerator. Two died shortly after treatment, two more might have died from the accident had they not died earlier from the cancer for which they were being treated. The other two suffered various degrees of scarring and permanent disability. In each of the accidents the electron beam was applied at full strength. Some of the patients reported immediate symptoms. At the same time the messages MALFUNCTION 54 and TREATMENT PAUSE appeared on the control screen. Unfortunately malfunctions were so common that operators chose to ignore them and press the “proceed” key. After two such incidents at the same hospital two staff experimented to reproduce the failure mode. The trigger for the failure was found to be: the operator had incorrectly entered “x” for X-Ray instead of “e” for electron, moved the “up-arrow” key, corrected the erroneous entry then moved the cursor to the bottom of the screen and pressed the “beam on” message - all within 8 seconds. This combination of events caused the power of the electron beam to be left at a value appropriate for X-Ray treatment, which was 100 times the power required for electron beam treatment. See [Leveson and Turner 1993] for details. London Ambulance Service Failure • • • • Novel computer-aided dispatch system collapsed System wasn’t tracking accurately the position and status of each ambulance Led to downward spiral of delays Ambulance crews accustomed to arriving in minutes now took hours • Multiple causes software assumed perfect ambulance position information recent change introduced memory leak operators were “out of the loop” The London Ambulance Service (LAS) carries 5000 patients and responds to 2000 to 2500 calls per day, of which 1300 to 1600 are emergencies. At the heart of the operation is the ambulance dispatch system, which is responsible for receiving calls for assistance, identifying the position of the nearest ambulance and crew, dispatching of the vehicle that can most rapidly attend to an outstanding call, and monitoring the vehicles status. LAS introduced a new computerised system that had not undergone a full trial under a realistic workload. At the same time the old system was scrapped. On the first day of service things were quiet until 10 o’ clock when problems were becoming apparent. As time went on the number of calls increased and the system was not keeping track accurately of the position and status of each ambulance. Using an increasingly incorrect database the system was dispatching vehicles that were not the closest to the scene, or making multiple assignments to cover a single call. This led to a large number of exception messages. As the queue of messages grew, the system slowed down. Delays in response to calls by ambulance crews built up, and members of the public placing follow-up calls further added to problems. The vicious downward spiral of delays generating messages, causing further delay, leading to more messages etc. continued until the system collapsed. Ambulance crews accustomed to arriving within minutes were now taking hours. The causes of the disaster were attributed to 1) the system “freezing” because it could not cope with the workload 2) the systems reliance on perfect information being available and 3) operators being left “out of the loop”. Some months later the system crashed because of a memory leak introduced during maintenance. After this failure the LAS went over to a manual system. See [Mellor 94] for a brief descrption and [LAS 94] for details of the inquiry conclusions. Communication/security AT&T Long Distance Service Fails: In In 1990, switching errors in AT&T's call-handling computers caused the company's long distance network to go down for nine hours. It was the worst of several telephone outages in the history of the system. The meltdown affected thousands of services and was eventually traced to one faulty line of C code in several hundred thousand. Java Opens Security Holes Browsers Simply Crashed This is not a single bug but a veritable bug collection. The sheer quantity of press coverage about bugs in Sun's Java and the two major browsers had a profound affect on how the average consumer perceives the Internet. The conglomeration of headlines probably set back the e-commerce industry by five years. Java's problems surfaced in 1996, when research at the University of Washington and Princeton began to uncover a series of security holes in Java that could, theoretically, allow hackers to download personal information from someone's home PC. To date, no one has reported a real case of a hacker exploiting the flaw, but knowing that the possibility existed prompted several companies to instruct employees to disable Java in their browsers. Internet Worm Self-propagating program brought down 10% of all internet nodes. Browser Wars Competition inspired Netscape and Microsoft to accelerate the schedules for their 4.0 browser releases and resulted in a swarm of bugs, ranging from JavaScript flaws in Netscape's Communicator to a reboot bug in Microsoft's Internet Explorer. Communicator was in Version 4.04 for Windows 95 and NT, six months after its first release. Internet Explorer 4.01, the first many bug-fix versions, arrived in December, two months after the initial release of IE 4.0. • ICL poll tax Company successfully sued for supply of faulty software ICL’s limited liability defence rejected by judge ICL was sued by St Albans City Council for supplying flawed software. The judge ruled that the standard liability clause contained in the firm’s contract did not apply under the Unfair Contract Terms Act 1977. The poll tax system, supplied by ICL, overestimated the number of eligible poll tax payers by 3,000. As a result the Council received less government funding than it was entitled to. Commercial • Pepsi Cola Fault led to printing of 500, 000 winning numbers rather than one Company faced asubstantial liability Pepsi Cola plants in the Philippines were besieged by angry customers after a software fault led to a printing failure - 500, 000 bottle tops were printed with the number 349, the winning number in a Pepsi drinks promotion. Pepsi faced a liability of £11.5 billion. • Air tours Air companies are completely dependent on automatic booking systems Failure in booking system led to loss of £5m Leading holiday firm Airtours lost millions of pounds in sales after its booking system crashed. The failure came at the worst possible time, coinciding with the company’s launch of its 1994 holiday brochure. One estimate put the cost of loss of sales at £5m. The failure was attributed to an unfortunate combination of operator and software failures. Environmental Deregulation of California Utilities Postponed In 1998, two new electrical power agencies charged with deregulating the California power industry postponed their plans by at least three months. The delay let them debug the software that runs the new power grid. Consumers and businesses were supposed to be able to choose from some 200 power suppliers as of January 1, 1998, but time ran out for properly testing the communications system that links the two new agencies with the power companies. The project was postponed after a seven-day simulation of the new system revealed serious problems. The delay cost as much as $90 million--much of which was eventually footed by ratepayers and which may have caused some of the new power suppliers to go into debt or out of business before they started. Concerns about Sizewell B Nuclear Power Station • • • Primary Protection System (PPS) is implemented in software 100, 000 lines of code in PPS Functionality: more sophisticated than conventional hardware approach provides diagnostic aids to reduce downtime it is configurable for different demand cycles automatically calibrated and tested • Concerns over complexity and size • Original requirements were imprecise •not enough Tested using dynamic and static techniques in themselves to give confidence The Sizewell B Nuclear Power Station’s emergency shutdown system is triggered by a primary protection system (PPS) implemented in 100 K of software. Software was used to bring additional operational advantages over hardware i.e. lower downtime and fewer false trips. The assurance of the system raised many technical questions and prompted some concern about the PPS’s reliability. The important issues were: • The software is too large and complex for any plausible claim that it is fault-free; • The basic protection functionality is not isolated from other less critical functionality (safety kernel approach) • The PPS is backed-up by the Secondary protection system (SPS) but there are possible demands which the SPS is not designed to handle • The design and development does not seem to have used industry best practice • The rigour and representativeness of testing activities is questionable • The complexity would make future changes difficult and potentially dangerous In addition to these concerns there were worries about the safety culture surrounding the project (Channel 4 Dispatches). Year 1900 bug In 1992 Mary from Winona, Minnesota, received an invitation to attend a kindergarten. Mary was 104 at the time.