software failures - School of Electronic Engineering and

advertisement
Software Failures
Financial
Chemical Bank’s ATMs



100 000 customers debited twice
fault in software
underlying cause was bank merger - software changed to cope with merged ATM
systems
VISA UK data centre
programmer error caused hundreds of valid cards to be rejected for several hours
Intuit's MacInTax Leaks Financial Secrets:
In 1995, Intuit's tax software for Windows and Macintosh suffered a series of bugs, including several that
prompted the company to pledge to pay any resulting penalties and interest. The scariest bug was
discovered in March 1995: the code included in a MacInTax debug file allowed Unix users to log in to
Intuit's master computer, where all MacInTax returns were stored. From there, the user could modify or
delete returns. Intuit later ended up winning BugNet's annual bug-fix award in 1996 by responding to bugs
faster than any other major vendor.
Financial markets


considerable risk of financial loss from software failure
fast development a necessity
TAURUS
Banks and banking are highly automated. High street banks are completely reliant on computer
systems for all loan, account and transaction processing. Global financial markets are likewise
dependent. The flow of capital around the world is completely automated - paper share
certificates, bonds and cheques do not often change hands until days or weeks after the
transactions have taken place. Confirmation of deals is almost entirely fax or telephone based.
The City of London’s Crest system will allow complete electronic trading in stocks and shares.
Financial projects of this kind have high dependability requirements.
As well as large computer based systems banks are also reliant on small, often single
programmer, projects. The derivatives markets continually launch new exotic financial products
whose profit opportunities will only be available over a very short time span (usually days).
Programmers will often “hack up” an analysis of the product’s yield in a matter of hours (using
spreadsheets etc.) and millions of dollars may “move” on the resulting recommendation. In some
investment banks the traders might even write their own code. Needless to say the risk is high,
but then so are the opportunities.
Transport
Mariner 1 Venus Probe Loses its Way:
In 1962, Mariner 1 launched from Cape Canaveral was set to go to Venus. After takeoff, the unmanned
rocket carrying the probe went off course and NASA had to blow up the rocket to avoid endangering
lives on earth. NASA later attributed the error to a faulty line of Fortran code. The report stated,
"Somehow a hyphen had been dropped from the guidance program loaded aboard the computer, allowing
the flawed signals to command the rocket to veer left and nose down… Suffice it to say, the first U.S.
attempt at interplanetary flight failed for want of a hyphen." The vehicle cost more than $80 million,
prompting Arthur C. Clarke to refer to the mission as "the most expensive hyphen in history."
New Denver Airport Misses its Opening:
In 1995, the Denver International Airport was intended to be a state-of-the-art airport, with a complex,
computerized baggage-handling system and 5,300 miles of fiber-optic cabling. Unfortunately, bugs in the
baggage system caused suitcases to be chewed up and drove automated baggage carts into walls. The
airport eventually opened 16 months late, $3.2 billion over budget, and with a mainly manual baggage
system.
NASA’s Space Shuttle
• First actual launch 10 April 1984 - 3 years late, millions of $ overspend.
•In 1989First“Program
planned launch cancelled because of computer synchronisation fault
Notes and Waivers” book detailed software faults
The NASA Space Shuttle is one of the largest documented software systems in the world - total
4M LOC including computers on ground (150,000 LOC inside shuttle in 5 computers). The total
system R&D costs were in excess of $10 billion. Most of the software is developed by IBM.
First actual launch: 10 April 1984; 3 years late, millions of $ overspend. First planned launch
cancelled - computer synchronisation fault. Fault traced to change made 2 years earlier:
• delay factor reset from 50 to 80 millisecs
• change meant 1/67 chance of launch failure
• failure not noticed during thousands of hours of software testing
In 1989 (5 years, 20 flights on) “Program Notes and Waivers”, the book of known software
problems was supplied to astronauts. It described a number of faults:
• interleaving of two messages on shuttle display screen
• use of common buffer area shared by on-board keyboard and ground communications:
program or data being uploaded at same time as astronaut causes contents of buffer to
be jumbled
Since 1989 a massive effort has been made to use the most sophisticated software engineering
techniques. This has resulted in significant claimed improvements to the software reliability
(backed up by impressive metrics).
C-17 cargo Plane (McDonnell Douglas)
$500 million over budget due to problems with its avionics software. C-17 included 19 onboard computers,
80 microprocessors, 6 different programming languages
Fly-by-wire aircraft (Airbus)
• First civilian fly-by-wire aircraft
• Computer
controls:
Electrical Flight Control System (EFCS) qualifies A320 as a fly-by-wire aircraft
•
Accidents:
Habsheim airshow
Bangalore
Warsaw
Strasbourg
•The Airbus
Poorer safety record than conventional aircraft
A320 was the first civilian aeroplane to use a fully computerised flight control system.
As the first “fly-by-wire” civilian airliner its safety record has been of great interest and has
attracted much comment.
Since its launch the A320 has suffered four fatal accidents which may be, at least partially,
explained by software failures. A number of potential causes for failure have been identified: the
requirements for on-board computers were inadequate where aircraft have been pushed to edge
of flight envelope; the on-board systems are more complex than conventional equipment - this
affected the way pilots flew the plane (increased the likelihood of pilot error). However, officially,
software was not to blame.
The table below shows the number of hull losses per million departures over a range of civilian
aircraft (See [Mellor 94] for details).
Aircraft Model
Boeing 757
Boeing 747-400
Boeing 767 (same generation as A320)
Airbus A300
Boeing 737
Airbus A320
DC-10
Hull losses per million
departures (to Dec 93)
0.00
1.86
0.29
0.98
0.53
2.50
2.67
In September 1994 an A340 inbound for Heathrow experienced a false alarm about fuel
immediately followed by two problems in the flight management system and the instrument
landing system. The AAIB recommended a significant improvement in hardware and software
reliability of the flight management and fuel sub-systems [AAIB 1995].
Military
• Star Wars missile crash - Cape
Canaveral
Aries rocket blown up 23 seconds after it was launched
Instead of heading northeast over the Atlantic it sped south
Technician accidentally loaded wrong software
Cost of launch was $5 million
Launch Controllers loaded the wrong computer program into the guidance unit of a “star wars”
(SDI) rocket that had to be destroyed when it veered sharply off course. The rocket was blown up
29 seconds after lift off after it was launched with “star wars” experiments. Instead of heading
northeast over the Atlantic it sped south. No injuries or property damage on the ground were
reported.
A technician accidentally hit the wrong key while loading software into rocket’s guidance
system. As a result ground test rather than flight software was loaded, causing the
steering nozzles to lock in place. No one checked to make sure the right software
was loaded.
Patriot Missile Misses:
In 1991, the U.S. Patriot missile's battery was used to head off Iraqi Scuds during the Gulf War. But the
system failed to track several incoming Scud missiles, including one that killed 28 U.S. soldiers in a
barracks in Dhahran, Saudi Arabia. The problem stemmed from a software error that put the tracking
system off by 0.34 of a second. As Ivars Peterson states in Fatal Defect, the system was originally
supposed to be operated for only 14 hours at a time. In the Dhahran attack, the missile battery had been on
for 100 hours. This meant that the errors in the system's clock accumulated to the point that the tracking
system no longer functioned. The military had in fact already found the problem but hadn't sent the fix in
time to prevent the barracks explosion.
Medical
Radiation Machine Kills Four 1985 to 1987:
Therac-25, a radiation-treatment machine made by Atomic Energy of Canada Limited (AECL), resulted in
several cancer patients receiving lethal overdoses of radiation. Four patients died. When their families sued,
all the cases were settled out of court. A later investigation by independent scientists Nancy Leveson and
Clark Turner found that accidents occurred even after AECL thought it had fixed particular bugs. "A lesson
to be learned from the Therac-25 story is that focusing on particular software bugs is not the way to make a
safe system," they wrote in their report. "The basic mistakes here involved poor software-engineering
practices and building a machine that relies on the software for safe operation."
•
•
•
Malfunction killed at least two patients; six received severe overdose
Software designers did not anticipate use of keyboard’s arrow keys
Possible reasons for failure
safety analysis neglected to omitted possibility of software fault
over confidence in software led to removal of hardware protection
programming done to commercial rather than safety-critical standards
Radiation therapy aims to destroy cancer by delivering a carefully calculated dose of radiation to
the tumor, while minimising the irradiation of the surrounding tissue. Treatment is applied via a
computer controlled radiotherapy machine consisting of an X-Ray and electron beam, a control
desk and a rotating table. The Therac-25 machine had this basic set-up except that the safety
interlocks were implemented in software rather than hardware.
Between June 1985 and January 1987 six patients received severe dosages of radiation while
being treated on the Therac-25 medical linear accelerator. Two died shortly after treatment, two
more might have died from the accident had they not died earlier from the cancer for which they
were being treated. The other two suffered various degrees of scarring and permanent disability.
In each of the accidents the electron beam was applied at full strength. Some of the patients
reported immediate symptoms. At the same time the messages MALFUNCTION 54 and
TREATMENT PAUSE appeared on the control screen. Unfortunately malfunctions were so
common that operators chose to ignore them and press the “proceed” key. After two such
incidents at the same hospital two staff experimented to reproduce the failure mode. The trigger
for the failure was found to be: the operator had incorrectly entered “x” for X-Ray instead of “e” for
electron, moved the “up-arrow” key, corrected the erroneous entry then moved the cursor to the
bottom of the screen and pressed the “beam on” message - all within 8 seconds. This
combination of events caused the power of the electron beam to be left at a value appropriate for
X-Ray treatment, which was 100 times the power required for electron beam treatment.
See [Leveson and Turner 1993] for details.
London Ambulance Service Failure
•
•
•
•
Novel computer-aided dispatch system collapsed
System wasn’t tracking accurately the position and status of each ambulance
Led to downward spiral of delays
Ambulance crews accustomed to arriving in minutes now took hours
•
Multiple causes
software assumed perfect ambulance position information
recent change introduced memory leak
operators were “out of the loop”
The London Ambulance Service (LAS) carries 5000 patients and responds to 2000 to 2500 calls
per day, of which 1300 to 1600 are emergencies. At the heart of the operation is the ambulance
dispatch system, which is responsible for receiving calls for assistance, identifying the position of
the nearest ambulance and crew, dispatching of the vehicle that can most rapidly attend to an
outstanding call, and monitoring the vehicles status.
LAS introduced a new computerised system that had not undergone a full trial under a realistic
workload. At the same time the old system was scrapped. On the first day of service things were
quiet until 10 o’ clock when problems were becoming apparent. As time went on the number of
calls increased and the system was not keeping track accurately of the position and status of
each ambulance. Using an increasingly incorrect database the system was dispatching vehicles
that were not the closest to the scene, or making multiple assignments to cover a single call. This
led to a large number of exception messages. As the queue of messages grew, the system
slowed down. Delays in response to calls by ambulance crews built up, and members of the
public placing follow-up calls further added to problems. The vicious downward spiral of delays
generating messages, causing further delay, leading to more messages etc. continued until the
system collapsed. Ambulance crews accustomed to arriving within minutes were now taking
hours.
The causes of the disaster were attributed to 1) the system “freezing” because it could not cope
with the workload 2) the systems reliance on perfect information being available and 3) operators
being left “out of the loop”. Some months later the system crashed because of a memory leak
introduced during maintenance. After this failure the LAS went over to a manual system.
See [Mellor 94] for a brief descrption and [LAS 94] for details of the inquiry conclusions.
Communication/security
AT&T Long Distance Service Fails:
In In 1990, switching errors in AT&T's call-handling computers caused the company's long distance
network to go down for nine hours. It was the worst of several telephone outages in the history of the
system. The meltdown affected thousands of services and was eventually traced to one faulty line of C code
in several hundred thousand.
Java Opens Security Holes Browsers Simply
Crashed
This is not a single bug but a veritable bug collection. The sheer quantity of press coverage about bugs in
Sun's Java and the two major browsers had a profound affect on how the average consumer perceives the
Internet. The conglomeration of headlines probably set back the e-commerce industry by five years.
Java's problems surfaced in 1996, when research at the University of Washington and Princeton began to
uncover a series of security holes in Java that could, theoretically, allow hackers to download personal
information from someone's home PC. To date, no one has reported a real case of a hacker exploiting the
flaw, but knowing that the possibility existed prompted several companies to instruct employees to disable
Java in their browsers.
Internet Worm
Self-propagating program brought down 10% of all internet nodes.
Browser Wars
Competition inspired Netscape and Microsoft to accelerate the schedules for their 4.0 browser releases
and resulted in a swarm of bugs, ranging from JavaScript flaws in Netscape's Communicator to a reboot
bug in Microsoft's Internet Explorer. Communicator was in Version 4.04 for Windows 95 and NT, six
months after its first release. Internet Explorer 4.01, the first many bug-fix versions, arrived in December,
two months after the initial release of IE 4.0.
• ICL poll tax
Company successfully sued for supply of
faulty software
ICL’s limited liability defence rejected by
judge
ICL was sued by St Albans City Council for supplying flawed software. The judge ruled that the
standard liability clause contained in the firm’s contract did not apply under the Unfair Contract
Terms Act 1977. The poll tax system, supplied by ICL, overestimated the number of eligible poll
tax payers by 3,000. As a result the Council received less government funding than it was entitled
to.
Commercial
• Pepsi Cola
Fault led to printing of 500, 000 winning
numbers rather than one
Company faced asubstantial liability
Pepsi Cola plants in the Philippines were besieged by angry customers after a software fault led
to a printing failure - 500, 000 bottle tops were printed with the number 349, the winning number
in a Pepsi drinks promotion. Pepsi faced a liability of £11.5 billion.
• Air tours
Air companies are completely dependent
on automatic booking systems
Failure in booking system led to loss of
£5m
Leading holiday firm Airtours lost millions of pounds in sales after its booking system crashed.
The failure came at the worst possible time, coinciding with the company’s launch of its 1994
holiday brochure. One estimate put the cost of loss of sales at £5m. The failure was attributed to
an unfortunate combination of operator and software failures.
Environmental
Deregulation of California Utilities Postponed
In 1998, two new electrical power agencies charged with deregulating the California power industry
postponed their plans by at least three months. The delay let them debug the software that runs the
new power grid. Consumers and businesses were supposed to be able to choose from some 200 power
suppliers as of January 1, 1998, but time ran out for properly testing the communications system that links
the two new agencies with the power companies. The project was postponed after a seven-day
simulation of the new system revealed serious problems. The delay cost as much as $90 million--much
of which was eventually footed by ratepayers and which may have caused some of the new power
suppliers to go into debt or out of business before they started.
Concerns about Sizewell B Nuclear Power Station
•
•
•
Primary Protection System (PPS) is implemented in software
100, 000 lines of code in PPS
Functionality:
more sophisticated than conventional hardware approach
provides diagnostic aids to reduce downtime
it is configurable for different demand cycles
automatically calibrated and tested
• Concerns over complexity and size
• Original requirements were imprecise
•not enough
Tested using dynamic and static techniques
in themselves to give confidence
The Sizewell B Nuclear Power Station’s emergency shutdown system is triggered by a primary
protection system (PPS) implemented in 100 K of software. Software was used to bring additional
operational advantages over hardware i.e. lower downtime and fewer false trips.
The assurance of the system raised many technical questions and prompted some concern
about the PPS’s reliability. The important issues were:
• The software is too large and complex for any plausible claim that it is fault-free;
• The basic protection functionality is not isolated from other less critical functionality
(safety kernel approach)
• The PPS is backed-up by the Secondary protection system (SPS) but there are possible
demands which the SPS is not designed to handle
• The design and development does not seem to have used industry best practice
• The rigour and representativeness of testing activities is questionable
• The complexity would make future changes difficult and potentially dangerous
In addition to these concerns there were worries about the safety culture surrounding the project
(Channel 4 Dispatches).
Year 1900 bug
In 1992 Mary from Winona, Minnesota, received an invitation to attend a kindergarten.
Mary was 104 at the time.
Download