Computers in Society Lecture 7, part 2: Software Reliability Assignment for Next Time Today we will talk about software reliability and responsibility for software faults. I will focus on a case study of the Therac-25, a radiation treatment machine. The Therac-25 suffered from software faults that caused it to give patients large overdoses of radiation. Some patients died as a result of the overdoses. The fault was difficult to find, so the machine was used even after some patients had been seriously harmed. Assignment for Next Time (2) For Wednesday’s class session, I want each group to discuss the Therac-25 case, answer some questions, and come to class prepared to share your answers. You should address the following questions: • Many people involved with the Therac-25 made mistakes or made bad decisions. Who made mistakes/bad decisions? How much responsibility does each such person (or group of persons) have for the injuries and death that were caused? Assignment for Next Time (3) • If you were the chairman of AECL, the company that produced the Therac-25, what would you have done after the software bug was understood? How would you have compensated the victims? How would you have changed your organization as a result? • How would you use your knowledge of the Therac25 case to organize software development to minimize the chances that such a situation could happen again? Assignment for Next Time (4) On Wednesday each group will have 5 to 10 minutes to explain their answers to the rest of the class. Please try to keep your answers and justifications succinct so we have time for discussion. The Therac-25 The Therac-25 was a linear accelerator device used to treat cancer using radiation. It was built by Atomic Energy of Canada Limited (AECL). It could use either X-rays or electron radiation. Therac-25 was a next-generation machine based on two earlier machines that AECL had built in cooperation with CGR, a French company. The earlier machines were the Therac-6 and the Therac20. The Therac-25 (2) The Therac-6 and Therac-20 incorporated a computer (PDP-11) as a front end. The computer was a front end only. The linear accelerators could be operated independently of the computer. All safety features were built into hardware. The Therac-25 integrated the computer with the linear accelerator. The Therac-25 (3) Two important changes created the possibility for problems: First, the Therac-25 reused code from the earlier machines. Second, some hardware safety features were replaced by software features. In machines with both electron and X-ray modes (dual-mode accelerators), a turntable rotates needed equipment into position to give proper doses of radiation. In older machines this was checked with hardware interlocks. Therac-25 checked this in software. The Therac-25 (4) The Therac-25 went into service in 1983. Eleven systems were delivered. Problems began occurring in June 1985. Therac-25 Problem History Accident 1: Marietta, Georgia, June 1985. Kennestone Regional Oncology Center (KROC) •A patient was burned by treatment and suffered crippling injuries. KROC contacted AECL and asked if the Therac-25 could have failed to diffuse the radiation beam. AECL said no. •The patient sued AECL and the hospital in October 1985. Therac-25 Problem History Accident 2: Hamilton, Ontario, July 1985. Ontario Cancer Foundation. •A patient was burned during treatment. The machine shut down during treatment. The display indicated no treatment had been made. The operator tried to proceed with treatment multiple times until the machine suspended treatment. The patient complained of being burned, and was hospitalized for radiation overdose three days later. •The patient died of cancer in November 1985. Therac-25 Problem History First AECL Investigation: July-September 1985 • AECL sent an engineer to investigate after the Ontario overdose. • The engineer discovered design problems related to a microswitch. • AECL introduced hardware and software changes to fix the microswitch problem. • A Canadian regulatory board requested a redesign of the handling of malfunction conditions, but AECL did not comply. Therac-25 Problem History Accident 3: Yakima, Washington, December 1985. Yakima Valley Memorial Hospital. • A patient developed a pattern of striped burns as a result of treatment. The hospital staff suspected that the pattern was from slots in the accelerator’s blocking trays. • AECL claimed that neither the Therac-25 or operator error could have produced the damage. AECL also claimed that no similar accidents had been reported. •The patient survived, though she was left with scarring and a mild disability. Therac-25 Problem History Accident 4: Tyler, Texas, March 1986. East Texas Cancer Center (ETCC). • During a treatment session the operator noticed she had entered an “X” (for X-ray) instead of an “E” (for electron) into the display. She quickly fixed the problem, moving the cursor to the field in error, changing it, and moving the cursor back to the bottom of the screen. The system was designed to detect that input was complete when the cursor was in the bottom right position. Therac-25 Problem History Accident 4 continued: • Once the operator was ready, she started treatment. After a few seconds the system shut down and gave the error code “Malfunction 54”. The operator continued treatment. • The patient, who had had eight previous treatments, knew something was wrong because of pain he experienced. •The patient died five months later of a radiation overdose of between 80-100 times the prescribed dose. Therac-25 Problem History Second AECL Investigation: March 1986. • ETCC shut down its Therac-25 after the accident and notified AECL. • AECL sent two engineers, who were unable to duplicate the problem. They claimed it was impossible for the Therac-25 to overdose a patient. They blamed the problems on the hospital’s electrical system. • ETCC found no problems with the electrical system, and put the Therac-25 back into service. Therac-25 Problem History Accident 5: Tyler, Texas, April 1986. East Texas Cancer Center (ETCC). •This accident was virtually the same as accident 4. The same operator was at the controls, and made the same change. The same behavior and “Malfunction 54” occurred. ETCC shut down the machine and contacted AECL. • The patient received a massive overdose of radiation to his brain and died three weeks later. • After this incident, investigators were able to duplicate it, and the first major software bug was detected. Therac-25 Problem History Therac-25 Declared Defective: May 2, 1986. • On May 2, 1986, the US Food and Drug Administration (FDA) declared the Therac-25 to be defective. • AECL was required to notify all Therac-25 customers. • To gain back FDA approval, AECL had to show how it would make the Therac-25 safe. Therac-25 Problem History Accident 6: Yakima, Washington, January 1987. Yakima Valley Memorial Hospital. • A second patient developed a pattern of striped burns as a result of treatment. • The hospital staff was able to match the burn marks to the slots in the Therac-25’s blocking tray. •The patient died three months later. Therac-25 Problem History Therac-25 Declared Defective: February 1987. • On February 10, 1987 the US Food and Drug Administration (FDA) declared the Therac-25 to be defective. It recommended that all machines be shut down. • To gain back FDA approval, AECL had to show how it would make the Therac-25 safe. Therac-25 Problem History Therac-25 Declared Defective: February 1987. • It took five months and five plans to receive FDA approval. The final plan included hardware interlocks to prevent overdoses or activating the beam when the turntable was not in the correct position. • No accidents have been reported since. Therac-25 Software Bugs One bug occurred because the system detected end of data entry when the cursor moved to the bottom right of the entry screen. At that point magnets for directing the beam would be positioned, which took a few seconds. After the magnets were positioned, the cursor was checked again. If it was at the bottom of the screen, no changes were detected. Therac-25 Software Bugs (2) If a fast-typing operator such as the one in Texas made a change while the magnets were moving and the restored the cursor to the bottom of the page, the changes would show on the page. However, the system would see the cursor at the bottom and not check for the changes. The previous (mistaken) data would be used. This is an example of a race condition. Therac-25 Software Bugs (3) A second race condition produced the overdoses at the Yakima center. It occurred when the machine was moving the gun into position. A variable was supposed to be zero if the beam was in position to fire. Any other value meant that the beam should not fire. When the beam was not in position, the variable would be incremented steadily. Therac-25 Software Bugs (4) The incrementing counter only held the values from 0-255. That meant that on occasion it reset to zero. If the operator pressed the button to fire the beam when the value reset from 255 to zero, the beam would fire even if it was not in position. This was a rare occurrence, but it could occur, and did on two occasions.