Lecture 4

advertisement
Software Reliability
25 September 2006
About the Evening Lectures

Viewing is required





All lectures will be recorded and shown during a
regular class period
Working on getting them posted on the web so
that you can download them at other times as well
Sign in sheet at lecture
Assignment: two paragraph summary of what
you learned
Dinner lottery
About the Midterm

Use of Blackboard



http://help.unc.edu/?id=4735&trail=4781
Installing SecureExam (see Guidelines on
home page)
Later this week, I will post a dummy exam
that you are all to take BEFORE the midterm
to assure that everything is working properly
Simplified Model of a Computer
retrieves the
instruction
directs data
movement
defines an
algorithm
Performs the
operations
processor
Control
Unit
Arithmetic
Logic Unit
instructions
data
MEMORY
the
information
that it works
on
Points to Remember



Computers access information by location and
doesn’t know the value
Computers store numbers in fixed size
packets, which means that they can not grow
indefinitely
Computers do not distinguish between
different types of data (e.g., instructions or
text or numbers)
Review: Computerized Systems

Finance: banking; stock market; commerce

Medical: diagnostics; life support; medical devices


Communications: television; radio; news; networks
Transportation: traffic signals; air traffic control; air craft; space craft;
trains; cars

Military: weapons systems; intelligence gathering


Energy: power plants; toxic chemical plants; oil & gas
Water: sewer

Buildings: HVAC; security; lights

Personal & household items
What is a Bug?
Bug


Problems in code that cause it to behave in an
unintended, unanticipated or unpredictable
manner
Origin

Grace Hopper (1947): moth in a relay
"First actual case of bug being found."

Thomas Edison used the term in 1878
1906-1992
"Bugs"—as such little faults and difficulties are called—
First Computer Bug
Why are bugs hard to find?

The error can appear in another program


Device drivers, memory management
The error may only occur occasionally

May require multiple conditions to occur
Classes of Problems






Poorly designed software
Poorly understood requirements
Poorly designed user interfaces
Improper use
Data entry problems
Simple coding errors
80% of software projects fail

50% challenged




2x budget
2x completion time
2/3 planned function
30% impaired

Scrapped
Standish Group, 1995
Sources of Risk
1.
2.
3.
4.
5.
6.
7.
Top management commitment
User commitment
Misunderstood requirements
Inadequate user involvement
Mismanaged user expectations
Scope creep
Lack of knowledge or skill
Keil et al, “A Framework for Identifying Software Project
Risks,” CACM 41:11, November 1998.
Can’t We Test Out the Problems?

In order to establish that the probability of failure of
software is less than 10-9 in 10 hours, testing
required with one computer is greater than 1
million years
Butler and Finelli, “The Infeasibility of Experimental
Quantification of Life-Critical Software Reliability”

NIST estimates cost to US economy from inadequate software
testing > $59 billion/yr.
NIST Planning Report 02-3
Simple Problems

Tampa couple was billed $4,062,599.57 for a
month’s electricity



Correct bill was $146.76
Input error – clearly not good enough check for
reasonable values
High School freshman banned from football because
of drug use in middle school


Actual offense was chewing gum and being tardy
Different codes not properly translated - systems are only
as good as their weakest links
User Interface Bug


Usability Issue
Afghanistan War (December 2001)


Use of GPS Receiver to determine coordinators



Friendly fire kills 3 injures 20 when satellite-guided bomb
landed on a battalion command post
Change battery
What should come up?
www.washingtonpost.com/ac2/wp-dyn/A8853-2002Mar23
Denver Airport Baggage System
(1995)


4 years in development at cost of $193M
The promise


Massively complex system






delivered in < 10 minutes to any part of airport!
4000 cars
21 miles of track
scanners
photocells
300 computers
What happened:



misrouted and crashed, baggage lost and damaged
Delayed opening cost $1.1M/day
When airport opened a year late only one airline used the system
www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html
Denver Airport Baggage System
(1995)


4 years in development at cost of $193M
Massively complex system


4000 cars, 21 miles of track, scanners, photocells, 300
computers
Cars misrouted and crashed, baggage lost and
damaged


Delayed opening cost $1.1M/day
When airport opened a year late only one airline used it
www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html
Denver Airport System

Examples of bugs:




Photocell could not detect bags on the belt and
therefore didn’t stop system
System had lost track of state of carts during jams
Timing between conveyor belts and carts not
properly synchronized
Overall


Not just software glitches
very complex, poorly engineered system
Ariane 5 (1996)
Software error
Integer overflow
External view
Only about 40 seconds after
initiation of the flight sequence, at an
altitude of about 3700 m, the
launcher veered off its flight path,
broke up and exploded
External view
Cost
Development cost $7 Billion
Delay of more than one year
One set of four identical, uninsured
scientific satellites
+ One rocket
$500,000,000
What Happened?


Overflow: tried to put too big a number into
too small a space
Even worse – the feature that caused the
problem wasn’t needed! It was only needed
to set up the launch!
archive.eiffel.com/doc/manuals/technology/contract/ariane/page.html
Bank of New York
November 20, 1985




BoNY: Nation’s largest clearer of Govt
securities.
Software to track Federal securities
transactions wrote new information on top
of old.
Feds debited the bank for each transaction
but bank did not know who owed it how
much.
90 minutes => $32 Billion overdraft!
Cost of Bug





Bank had to borrow $24 billion from federal
reserves. Interest paid ~$5 million for 1 day.
(Annual earnings of bank ~120 million)
BoNY share prices dropped by 25¢
Federal funds rate dropped from 8.4% to 5.5%
System down for 28 hours.
Fear of financial crisis caused increase in
price of platinum!
Cause of bug




Message buffer counter at BoNY system was
16-bit long.
Counters at Fed (and other banks) 32 bit.
More than 32,000 transactions that morning!
=>Counter overflow
Securities database corrupted.
The Drama continues…

Trying to correct it – they copied corrupted
data over the backup.
Lost a few hours because of this.

Reference: Wiener, Digital Woes, 1993

Therac-25


Landmark case of how things can go terribly wrong
Medical linear accelerator: radiation therapy for
cancer patients

Used to zap tumors with high energy beams



Eleven Therac-25s were installed:



Electron beams for shallow tissue
X-ray photons for deeper tissue
Six in Canada
Five in the United States
Developed by Atomic Energy of Canada Limited
(AECL).
Therac-25

Improvements over Therac-20:



Uses new “double pass” technique to accelerate
electrons.
Machine itself takes up less space.
Other differences from the Therac-20:

Software now coupled to the rest of the system
and responsible for safety checks.


Hardware safety interlocks removed.
“Easier to use.”
Therac-25 Turntable
Field Light Mirror
Counterweight
Turntable
Scan Magnet
(Electron Mode)
Beam Flattener
(X-ray Mode)
1985-1987: Six known accidents



Jun 1985: Patient at Mareitta GA received
overdose
July 1985: Hamilton, Ontario: patient
severely burned, died that November.
December 1985: Patient in Yakima, WA
overdose
Vernon Kidd

Early March 1986, Tyler, Tx:
 receives dose > 100 times too high
 Complained he felt burned…..

Engineer: It’s not possible for Therac-25 to give an
overdose.

Engineering firm: Machine does not appear capable
of giving a patient an electrical shock...

Died 5 months later

Put back in use late March
What Went Wrong?

User Interface




Operator entered code for high energy rather than
low energy
“Malfunction message”
Operator entered “Proceed” because system was
known to give quirky errors
Result

Turntable was in the wrong position
3 Weeks Later: Ray Cox

Second accident in Tyler, Tx

Same operator

Patient died 1 month later

This time they were able to reproduce
What would cause that to happen?

Race conditions.


Overflow error.




The turntable position was not checked every 256th time
the “Class3” variable is incremented.
No hardware safety interlocks.
Wrong information on the console.
Non-descriptive error messages.



Several different race condition bugs.
“Malfunction 54”
“H-tilt”
User-override-able error modes.
Source of the Bug



Incompetent engineering.
Safety analysis excluded the software!
No usability testing.
Sources

Leveson, N., Turner, C. S., An Investigation of the Therac-25
Accidents. IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41.
http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html

Information for this article was largely obtained from primary sources
including official FDA documents and internal memos, lawsuit depositions,
letters, and various other sources that are not publicly available.
The authors:
Nancy Leveson
Clark S. Turner
Lots more stories

Links will be added to references section of
web


http://www5.in.tum.de/~huckle/bugse.html
http://www.baddesigns.com/
Final Discussion

Should Microsoft be held responsible for the
business problems and viruses caused by
security holes in their software?
Download