Nancy G. Leveson
Clark S. Turner
IEEE, 1993
Presented by Jack Kustanowitz
April 26, 2005
University of Maryland
What happened
Accident history
Development history
Technical problems
Company responses
Lessons learned
Ethical questions
Resources
University of Maryland
2
Between June 1985 and January 1987, 6 known accidents involving massive overdoses, causing death & serious injury
University of Maryland
3
June 3 1985 : First overdose
July-Dec 1985 : Two more overdoses, patient sues AECL and hospital, two requests for modifications
Jan-Feb 1986 : Denial of possibility of overdose
Mar-Apr 1986 : Two more overdoses, software blamed
May-Dec 1986 : FDA declares Therac-25 defective, CAPs
(Corrective Action Plans) sent back and forth between FDA and
AECL. First Therac-25 user group meeting.
Jan 1987 : Sixth overdose
Feb-July 1987 : More CAPs back and forth until fifth revision of CAP sent to FDA
Nov 1988 : Final safety analysis report issued
Grueling first-hand descriptions of what it felt like to get a massive radiation overddose
University of Maryland
4
Therac-6: 6 MeV accelerator for x-rays
Therac-20: 20 MeV dual-mode (x-rays or electrons)
Separate hardware interlocks
Therac-25: 25 MeV dual-mode
All safeguards done in software
Testing
“Unit and software testing was minimal, with most effort directed at the integrated system test”
Software written in assembly on a PDP-11
University of Maryland
5
University of Maryland
6
At first, operator needed to enter information at the treatment table, and then re-enter at a console in the control room
Operators complained; safeguard was removed
Error codes are reported on the screen with no English explanation
Example: (East Texas Cancer Center) “Malfunction 54” reported, caused by “dose input 2”. An AECL technician testified that
“does input 2” means the dose delivered was either too high or too low (!)
“Treatment Pause” after non-critical error, which operator can ignore by pressing “P”
Causes operators to become insensitive to errors
University of Maryland
7
Data Entry Bug
Setting the bending magnets takes 8 seconds
“Delay” subroutine uses shared memory with the data entry subroutine
So data changes within 8 seconds will be wiped out when Delay exits!
Causes bugs that only show up with proficient users who do data entry in <8 seconds
Set-Up Test Bug
On every 256 th pass through Set-Up (one-byte counter), the upper collimator is not checked
Problem if operator hits “set” exactly when counter rolls over to 0
These kinds of bugs are notoriously difficult to track down
University of Maryland
8
Denial
“We did not believe that there could have been any accelerator malfunction”
Incremental, local band-aid fixes
Example: “P” key removed to prevent operators from ignoring warnings
Dragging feet, doing minimum of FDA’s requests
Perhaps justified? See ethics discussion…
Knee-jerk responses – fix the bugs as they are reported
Difficulty reproducing bugs (that only happened once in several hundred runs)
University of Maryland
9
Focusing on particular software bugs is not the way to make a safe system
Assumption that fixing one error would prevent further accidents
“There is always another software bug”
It is a bad idea to remove independent hardware interlocks , and to believe too much in software
Assume software will fail, and handle that properly, rather than trying to write “perfect” software
Don’t believe in numerical claims
“Risk assessment can be like the captured spy: if you torture it long enough, it will tell you anything you want to know”
Record the reasons for design decisions (like duplicate data entry)
Design for the worst case
Don’t enhance usability at the expense of safety
Power of user groups to cause change when companies drag their feet
University of Maryland
10
Documentation should not be an afterthought
Establish QA practices & standards
Keep designs simple
Design audit trails and logging from the beginning
Perform extensive testing and formal analysis at the module and software level, rather than relying on system-level testing
Summary of this course!
University of Maryland
11
500 patients treated in East Texas before first serious accident
Too much government oversight slows progress
If 1 person was getting hurt for every 1000 helped, would you take the machine out of use? How about 1:100? 1:10000?
Where’s the line?
12
University of Maryland
http://www.technology.niagarac.on.ca/courses/ctec
1435/notes/Therac-25/SouthPark/01.htm
University of Maryland
13