Software Reliability 25 September 2006 About the Evening Lectures Viewing is required All lectures will be recorded and shown during a regular class period Working on getting them posted on the web so that you can download them at other times as well Sign in sheet at lecture Assignment: two paragraph summary of what you learned Dinner lottery About the Midterm Use of Blackboard http://help.unc.edu/?id=4735&trail=4781 Installing SecureExam (see Guidelines on home page) Later this week, I will post a dummy exam that you are all to take BEFORE the midterm to assure that everything is working properly Simplified Model of a Computer retrieves the instruction directs data movement defines an algorithm Performs the operations processor Control Unit Arithmetic Logic Unit instructions data MEMORY the information that it works on Points to Remember Computers access information by location and doesn’t know the value Computers store numbers in fixed size packets, which means that they can not grow indefinitely Computers do not distinguish between different types of data (e.g., instructions or text or numbers) Review: Computerized Systems Finance: banking; stock market; commerce Medical: diagnostics; life support; medical devices Communications: television; radio; news; networks Transportation: traffic signals; air traffic control; air craft; space craft; trains; cars Military: weapons systems; intelligence gathering Energy: power plants; toxic chemical plants; oil & gas Water: sewer Buildings: HVAC; security; lights Personal & household items What is a Bug? Bug Problems in code that cause it to behave in an unintended, unanticipated or unpredictable manner Origin Grace Hopper (1947): moth in a relay "First actual case of bug being found." Thomas Edison used the term in 1878 1906-1992 "Bugs"—as such little faults and difficulties are called— First Computer Bug Why are bugs hard to find? The error can appear in another program Device drivers, memory management The error may only occur occasionally May require multiple conditions to occur Classes of Problems Poorly designed software Poorly understood requirements Poorly designed user interfaces Improper use Data entry problems Simple coding errors 80% of software projects fail 50% challenged 2x budget 2x completion time 2/3 planned function 30% impaired Scrapped Standish Group, 1995 Sources of Risk 1. 2. 3. 4. 5. 6. 7. Top management commitment User commitment Misunderstood requirements Inadequate user involvement Mismanaged user expectations Scope creep Lack of knowledge or skill Keil et al, “A Framework for Identifying Software Project Risks,” CACM 41:11, November 1998. Can’t We Test Out the Problems? In order to establish that the probability of failure of software is less than 10-9 in 10 hours, testing required with one computer is greater than 1 million years Butler and Finelli, “The Infeasibility of Experimental Quantification of Life-Critical Software Reliability” NIST estimates cost to US economy from inadequate software testing > $59 billion/yr. NIST Planning Report 02-3 Simple Problems Tampa couple was billed $4,062,599.57 for a month’s electricity Correct bill was $146.76 Input error – clearly not good enough check for reasonable values High School freshman banned from football because of drug use in middle school Actual offense was chewing gum and being tardy Different codes not properly translated - systems are only as good as their weakest links User Interface Bug Usability Issue Afghanistan War (December 2001) Use of GPS Receiver to determine coordinators Friendly fire kills 3 injures 20 when satellite-guided bomb landed on a battalion command post Change battery What should come up? www.washingtonpost.com/ac2/wp-dyn/A8853-2002Mar23 Denver Airport Baggage System (1995) 4 years in development at cost of $193M The promise Massively complex system delivered in < 10 minutes to any part of airport! 4000 cars 21 miles of track scanners photocells 300 computers What happened: misrouted and crashed, baggage lost and damaged Delayed opening cost $1.1M/day When airport opened a year late only one airline used the system www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html Denver Airport Baggage System (1995) 4 years in development at cost of $193M Massively complex system 4000 cars, 21 miles of track, scanners, photocells, 300 computers Cars misrouted and crashed, baggage lost and damaged Delayed opening cost $1.1M/day When airport opened a year late only one airline used it www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html Denver Airport System Examples of bugs: Photocell could not detect bags on the belt and therefore didn’t stop system System had lost track of state of carts during jams Timing between conveyor belts and carts not properly synchronized Overall Not just software glitches very complex, poorly engineered system Ariane 5 (1996) Software error Integer overflow External view Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded External view Cost Development cost $7 Billion Delay of more than one year One set of four identical, uninsured scientific satellites + One rocket $500,000,000 What Happened? Overflow: tried to put too big a number into too small a space Even worse – the feature that caused the problem wasn’t needed! It was only needed to set up the launch! archive.eiffel.com/doc/manuals/technology/contract/ariane/page.html Bank of New York November 20, 1985 BoNY: Nation’s largest clearer of Govt securities. Software to track Federal securities transactions wrote new information on top of old. Feds debited the bank for each transaction but bank did not know who owed it how much. 90 minutes => $32 Billion overdraft! Cost of Bug Bank had to borrow $24 billion from federal reserves. Interest paid ~$5 million for 1 day. (Annual earnings of bank ~120 million) BoNY share prices dropped by 25¢ Federal funds rate dropped from 8.4% to 5.5% System down for 28 hours. Fear of financial crisis caused increase in price of platinum! Cause of bug Message buffer counter at BoNY system was 16-bit long. Counters at Fed (and other banks) 32 bit. More than 32,000 transactions that morning! =>Counter overflow Securities database corrupted. The Drama continues… Trying to correct it – they copied corrupted data over the backup. Lost a few hours because of this. Reference: Wiener, Digital Woes, 1993 Therac-25 Landmark case of how things can go terribly wrong Medical linear accelerator: radiation therapy for cancer patients Used to zap tumors with high energy beams Eleven Therac-25s were installed: Electron beams for shallow tissue X-ray photons for deeper tissue Six in Canada Five in the United States Developed by Atomic Energy of Canada Limited (AECL). Therac-25 Improvements over Therac-20: Uses new “double pass” technique to accelerate electrons. Machine itself takes up less space. Other differences from the Therac-20: Software now coupled to the rest of the system and responsible for safety checks. Hardware safety interlocks removed. “Easier to use.” Therac-25 Turntable Field Light Mirror Counterweight Turntable Scan Magnet (Electron Mode) Beam Flattener (X-ray Mode) 1985-1987: Six known accidents Jun 1985: Patient at Mareitta GA received overdose July 1985: Hamilton, Ontario: patient severely burned, died that November. December 1985: Patient in Yakima, WA overdose Vernon Kidd Early March 1986, Tyler, Tx: receives dose > 100 times too high Complained he felt burned….. Engineer: It’s not possible for Therac-25 to give an overdose. Engineering firm: Machine does not appear capable of giving a patient an electrical shock... Died 5 months later Put back in use late March What Went Wrong? User Interface Operator entered code for high energy rather than low energy “Malfunction message” Operator entered “Proceed” because system was known to give quirky errors Result Turntable was in the wrong position 3 Weeks Later: Ray Cox Second accident in Tyler, Tx Same operator Patient died 1 month later This time they were able to reproduce What would cause that to happen? Race conditions. Overflow error. The turntable position was not checked every 256th time the “Class3” variable is incremented. No hardware safety interlocks. Wrong information on the console. Non-descriptive error messages. Several different race condition bugs. “Malfunction 54” “H-tilt” User-override-able error modes. Source of the Bug Incompetent engineering. Safety analysis excluded the software! No usability testing. Sources Leveson, N., Turner, C. S., An Investigation of the Therac-25 Accidents. IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41. http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html Information for this article was largely obtained from primary sources including official FDA documents and internal memos, lawsuit depositions, letters, and various other sources that are not publicly available. The authors: Nancy Leveson Clark S. Turner Lots more stories Links will be added to references section of web http://www5.in.tum.de/~huckle/bugse.html http://www.baddesigns.com/ Final Discussion Should Microsoft be held responsible for the business problems and viruses caused by security holes in their software?