Reliability and Safety

advertisement
Reliability and Safety
Week 12
What can go wrong?
Issues:
Hardware Errors
Software Errors
Fault vs Error
Computer failure causes:
 Faulty design
 Sloppy implementation
 Careless or insufficiently trained
users
 Poor user interfaces
 Hardware/Software malfunctions
 Specification errors
 Scope/Application inconsistency
Computer users
perspective
Should understand
limitations of the
computers
Need for proper training
Need for responsible use
Difference between good
products and bad ones
Computer Professional
Perspective
Study computer failures
Study computer ethics
Educated Member of
Society Perspective
Help us evaluate the
reliability and safety of
various computer
applications
Help evaluate computer
technology
Three Categories of
Failures
Problems for individuals
System failures that affect
large numbers of people or
cost large amounts of
money
Problems in safety-critical
applications
Problems for Individuals
 Billing Errors
 design and/or
implementation of programs
 Not enough care - input
error
 Not enough testing reasonable range
 Not enough training
Database Accuracy
Problems
 Info in database is not
accurate
 Automatic entering of info mistakes can be overlooked
 Copies of incorrect info can be
sent to other systems
 Not knowledgeable enough
about the system
Causes
 Large population
 Most of our financial
interactions are with strangers
 Automated processing without
human common sense
 Overconfidence in accuracy of
data
Lack of accountability
Consumer Hardware and
Software
 Usually have more serious errors in
their first releases
 Regularly sold with known bugs
 Hardware also has flaws
 tradeoff between cost, debugging,
and marketing
 Dishonesty, denials of problems,
lack of adequate response to
complaints
System Failures
 Lots of $$$$
 Complete shutdown of basic
services
 Areas:
 communications
 Business and financial
systems
 Military
WHY?
Not enough testing
Technical difficulties
Poor management
decisions
Dishonesty in promoting
the system and responding
to problems
Communications
 Phone Service
 How Bad?
 pagers
 phone calls
 911
 Communications for airports
 cellular phones
Business and financial
systems
Stock exchange
ATM
Contest by Pepsi
 too many winning tickets
issued
Destroying Business
Loss of sales
incorrect info affects
business
dissatisfied customers
incorrect prices
loss of data
Military
 Data management
 Weapons system design
 Battle simulation
 Battle management
 command/control
 communications
 intelligence
 Nuclear war
Why?
Not enough testing
technical difficulties
poor management decisions
dishonesty in promoting the
system and responding to
problems
 Results in delays and abandonment
of projects
 Heard Before?




The Denver Airport
baggage system
 Outbound luggage checked at
ticket counters or curbside
 to be delivered to anywhere in
<10 minutes
 via automated system of cars on
tracks
 connecting flights or terminals
 Laser scanners
 tracks - 4000 cars
Problems Encountered
Cars crash into each other
at intersections
Luggage misrouted,
dumped or flung
Needed cars were idle or
put to rest
Specific problems
Real world problems
 scanners got dirty
 knocked out of alignment
Software error
 rerouting of cars to
waiting area - idle
Causes
 Time allowed for development
and testing was insufficient
 Significant changes in
specifications were made after
project began
 Not enough debug time
 Poor management
 Unrealistic plan
Safety Critical
Applications
 Use of computers is increasing rapidly in
these areas
 Use of computers in these areas can save
$
 Areas
 Military
Medical
Applications
 Power plants
 Aircraft
 Trains
Aircraft - Fly by Wire
 Pilots do not directly control plane
 Actions are input to computers that
control the aircraft systems
 Pilot interaction is critical
 Need for easy way to override
computers
 Easy transfer between automatic
and manual control
Air Traffic Control
Long delays
Increased risk of collisions
Old machines - computer
systems
Political - government
spends $ elsewhere
Case Study - Therac-25
 Software controlled radiation therapy
machine used to treat people with cancer
 Problems:
 Massive overdoses administered
 Repeated overdoses due to faulty
display
 Death
 Operated in dual machine mode electron beam or x-ray photon beam
Why?
 Lapses in good safety design
 Insufficient testing
 Bugs in software that
controlled machines
 Inadequate system of
reporting and investigating
accidents and deaths
Specific problems
 Some hardware safety features
were eliminated in newer models
 Software used was assumed
correct from older systems
 Malfunctioned frequently
 Weakness in design of operator
interface
 inadequate explanation of error
messages if any
Specific problems
continued
Machine allowed one-key
intervention versus
automatic shutdown
Inadequate documentation
Poor test plan
Software Errors - bugs
 Fatal error was a simple fix
 Fixes are complex, expensive, and
prevents use of machine while fixing
 Bugs
 can be intermittent and hard to detect
 importance of self checking
 importance of using good
programming techniques
Overconfidence
Leaving out changes that
are necessary
Ignoring error messages
Not using backup devices
(video or audio)
Conclusion and
Perspective
 Irresponsibility leads to criminal
charges
 Responsibility leads to merit awards
 Importance of good software
development
 Consequences of carelessness, cutting
corners, unprofessional work, or
attempts to avoid responsibility
 Lack of appreciation for risks
 Poor training
Ways to prevent problems
Good computer systems
Good training
Accountability
Individual responsibility
Management responsibility
IEEE Code of Ethics
Increasing Reliability and
Safety
What goes wrong?
 Many lines of code and
many programmers
 Problems are
managerial, technical,
social, legal, ethical
Overconfidence
Unappreciative of risks
Ignore warnings
Don’t consult manuals
Professional Techniques
 Use good software engineering
techniques at all stages of
development:
 Requirements
 Specs
 design
 implementation
 documentation
 testing
Professional Techniques
Study the techniques and
tools available
Knowing or learning
enough about the
application field and the
software or systems being
used
Why Study Failures?
Provides technical lessons
Leads to improved
hardware and software
products
Provide ethical data
Lead to improved ethical
codes/laws
Lessons Learned
 Accidents are not the result of
unknown scientific principles but
rather a failure to apply wellknown engineering practices
 Accidents will not be prevented by
technological fixes alone, requires
control of all aspects of the
development and operation of the
system
Lessons Learned
Software developers need
to recognize the limitations
of software, and use
hardware safety
mechanisms
Redundancy and Selfchecking
 Redundancy - judging - expensive
 Complex systems collect
information to diagnose and
correct errors
 Audit trails are vital
 Detail records help protect against
theft and help trace and correct
errors
Redundancy and Selfchecking
 Designed to constantly monitor itself and
correct problems automatically
 Half of the computing power is devoted
to checking
 The rest for errors
 closes off part of the system
 reroutes
 corrects problems and reroutes again
TESTING
CRITICAL!
Principles and techniques
exist
can use another company
to perform
Independent verification
and validation
Dangerous Tendencies
 Operators
 bypass check mechanisms through
familiarity
 Technicians
 Blame random mechanical or signal
glitches rather than software
 Corporate Managers
 Initially deny and ignore - then cover
up
 Finally - deal with expensive fixes
Overall Lessons Learned
 Should not declare problem
understood with first hypothesis
 Should not expect management to
follow through on field reports
 Overconfidence in software leads
to economical marginal designs
Overall Lessons Learned
Enforcement of software
engineering practices is
often abysmal
Basing risk assessments on
individual subsystems
often leads to unrealistic
optimism
Lessons for systems
engineering
 Hardware backups valuable
 Software must not be
presumed innocent
 Audit trails are critical
 Risk estimates are subjective
 User feedback is valuable
Lessons for software
engineering
 Documentation should be on-going
 Designs should be kept simple
 Testing should be built into
software
 Software must be tested out of
system and in system
 Reuse of software should be tested
like new software
Lessons for oversight
Users are more likely to
make initial observations
than monitoring officials
Users need reliable
information in order to be
maximally valuable
Laws and Regulations
 Criminal and Civil penalties
 Suits against company that
designs or sells the system
 Criminal charges when fraud
or criminal negligence occurs
 Need contracts
 Need well designed laws and
standards
Regulation
 Requirement for approval by a
government agency before a new
product can be sold
 including specific testing
requirements
 The profit motive causes skimping on
safety
 Better to abandon in some cases
 Inadequate abilities to judge by
customer
 Hard to sue large companies
Regulation
Expensive and timeconsuming
Newer procedures may not
be enforced
Lots of paperwork
Professional licensing
 Licensing of software development
professionals to protect against
poor quality and unethical behavior
 Specific training
 Passing competency exam
 Ethical requirements
 Continuing education
Download