CS 501: Software Engineering Reliability II CS 501 Spring 2002 Lecture 22

advertisement
CS 501: Software Engineering
Lecture 22
Reliability II
1
CS 501 Spring 2002
Administration
2
CS 501 Spring 2002
Software Reliability
Failure: Software does not deliver the service expected by
the user (e.g., mistake in requirements)
Fault (BUG): Programming or design error whereby the
delivered system does not conform to specification
Reliability: Probability of a failure occurring in operational
use.
Perceived reliability: Depends upon:
user behavior
set of inputs
pain of failure
3
CS 501 Spring 2002
User Perception of Reliability
1. A personal computer that crashes frequently v. a machine
that is out of service for two days.
2. A database system that crashes frequently but comes back
quickly with no loss of data v. a system that fails once in
three years but data has to be restored from backup.
3. A system that does not fail but has unpredictable periods
when it runs very slowly.
4
CS 501 Spring 2002
Reliability Metrics
Traditional Measures
• Mean time between failures
• Availability (up time)
• Mean time to repair
User Perception is Influenced by
• Distribution of failures
Hypothetical example: Cars are safer than
airplane in accidents (failures) per hour, but less
safe in failures per mile.
5
CS 501 Spring 2002
Reliability Metrics for Distributed Systems
Traditional metrics are hard to apply in multi-component
systems:
•
In a big network, at any given moment something will be giving
trouble, but very few users will see it.
• A system that has excellent average reliability may give
terrible service to certain users.
• There are so many components that system administrators
rely on automatic reporting systems to identify problem areas.
6
CS 501 Spring 2002
Cost of Improved Reliability
$
Up time
99%
100%
Will you spend your money on new functionality
or improved reliability?
7
CS 501 Spring 2002
Example: Dartmouth Time Sharing (1978)
A central computer serves the entire campus. Any
failure is serious.
Step 1
Gather data on every failure
•
10 years of data in a simple data base
•
Every failure analyzed:
hardware
software (default)
environment (e.g., power, air conditioning)
human (e.g., operator error)
8
CS 501 Spring 2002
Example: Dartmouth Time Sharing (1978)
Step 2
Analyze the data
• Weekly, monthly, and annual statistics
Number of failures and interruptions
Mean time to repair
• Graphs of trends by component, e.g.,
Failure rates of disk drives
Hardware failures after power failures
Crashes caused by software bugs in each module
9
CS 501 Spring 2002
Example: Dartmouth Time Sharing (1978)
Step 3
Invest resources where benefit will be maximum, e.g.,
•
Orderly shut down after power failure
•
Priority order for software improvements
•
Changed procedures for operators
• Replacement hardware
10
CS 501 Spring 2002
Terminology
Fault avoidance
Build systems with the objective of creating faultfree systems
Fault tolerance
Build systems that continue to operate when faults
occur
Fault detection (testing and validation)
Detect faults before the system is put into operation.
11
CS 501 Spring 2002
Fault Avoidance: Cleanroom Software
Development
Software development process that aims to develop zero-defect
software.
•
•
•
•
•
Formal specification
Incremental development with customer input
Constrained programming options
Static verification
Statistical testing
It is always better to prevent defects than to remove them later.
Example: The four color problem.
12
CS 501 Spring 2002
Fault Tolerance
General Approach:
• Failure detection
• Damage assessment
• Fault recovery
• Fault repair
N-version programming -- Execute independent
implementation in parallel, compare results, accept the
most probable.
13
CS 501 Spring 2002
Fault Tolerance
Basic Techniques:
• After error continue with next transaction
• Timers and timeout in networked systems
•
Error correcting codes in data
•
Bad block tables on disk drives
• Forward and backward pointers
Report all errors for quality control
14
CS 501 Spring 2002
Fault Tolerance
Backward Recovery:
15
•
Record system state at specific events (checkpoints). After
failure, recreate state at last checkpoint.
•
Combine checkpoints with system log that allows
transactions from last checkpoint to be repeated
automatically.
CS 501 Spring 2002
Requirements Specification of System
Reliability
Example: ATM card reader
Failure class
16
Example
Metric
Permanent
System fails to operate
non-corrupting with any card -- reboot
1 per 1,000 days
Transient
System can not read
non-corrupting an undamaged card
1 in 1,000 transactions
Corrupting
Never
A pattern of
transactions corrupts
database
CS 501 Spring 2002
Software Engineering for Real Time
The special characteristics of real time computing require
extra attention to good software engineering principles:
• Requirements analysis and specification
• Development of tools
• Modular design
• Exhaustive testing
Heroic programming will fail!
17
CS 501 Spring 2002
Software Engineering for Real Time
Testing and debugging need special tools and environments
18
•
Debuggers, etc., can not be used to test real time
performance
•
Simulation of environment may be needed to test interfaces
-- e.g., adjustable clock speed
•
General purpose tools may not be available
CS 501 Spring 2002
Defensive Programming
Murphy's Law:
If anything can go wrong, it will.
Defensive Programming:
19
•
Redundant code is incorporated to check system state after
modifications
•
Implicit assumptions are tested explicitly
CS 501 Spring 2002
Defensive Programming Examples
• Use boolean variable not integer
• Test i <= n not i = = n
• Assertion checking
• Build debugging code into program with a switch to
display values at interfaces
• Error checking codes in data, e.g., checksum or hash
20
CS 501 Spring 2002
Error Avoidance
Risky programming constructs
•
•
•
•
•
•
Pointers
Dynamic memory allocation
Floating-point numbers
Parallelism
Recursion
Interrupts
All are valuable in certain circumstances, but
should be used with discretion
21
CS 501 Spring 2002
Static and Dynamic Verification
Static verification: Techniques of verification that do not
include execution of the software.
• May be manual or use computer tools.
Dynamic verification
• Testing the software with trial data.
• Debugging to remove errors.
22
CS 501 Spring 2002
Static Validation & Verification
Carried out throughout the software development process.
Validation &
verification
Requirements
specification
Design
Program
REVIEWS
23
CS 501 Spring 2002
Static Verification: Program Inspections
Formal program reviews whose objective is to detect faults
•
Code may be read or reviewed line by line.
• 150 to 250 lines of code in 2 hour meeting.
•
Use checklist of common errors.
•
Requires team commitment, e.g., trained leaders
So effective that it is claimed that it can replace unit testing
24
CS 501 Spring 2002
Inspection Checklist: Common Errors
Data faults: Initialization, constants, array bounds, character
strings
Control faults: Conditions, loop termination, compound
statements, case statements
Input/output faults: All inputs used; all outputs assigned a
value
Interface faults: Parameter numbers, types, and order;
structures and shared memory
Storage management faults: Modification of links,
allocation and de-allocation of memory
Exceptions: Possible errors, error handlers
25
CS 501 Spring 2002
Static Analysis Tools
Program analyzers scan the source of a program for possible
faults and anomalies (e.g., Lint for C programs).
•
Control flow: loops with multiple exit or entry points
•
Data use: Undeclared or uninitialized variables, unused
variables, multiple assignments, array bounds
•
Interface faults: Parameter mismatches, non-use of
functions results, uncalled procedures
• Storage management: Unassigned pointers, pointer
arithmetic
26
CS 501 Spring 2002
Static Analysis Tools (continued)
•
Cross-reference table: Shows every use of a variable,
procedure, object, etc.
•
Information flow analysis: Identifies input variables on which
an output depends.
•
Path analysis: Identifies all possible paths through the
program.
Modern compilers contain many static analysis tools
27
CS 501 Spring 2002
Fixing Bugs
Isolate the bug
Intermittent --> repeatable
Complex example --> simple example
Understand the bug
Root cause
Dependencies
Structural interactions
Fix the bug
Design changes
Documentation changes
Code changes
28
CS 501 Spring 2002
Moving the Bugs Around
Fixing bugs is an error-prone process!
• When you fix a bug, fix its environment
• Bug fixes need static and dynamic testing
• Repeat all tests that have the slightest relevance
(regression testing)
Bugs have a habit of returning!
• When a bug is fixed, add the failure case to the test suite
for the future.
29
CS 501 Spring 2002
Some Notable Bugs
•
Built-in function in Fortran compiler (e0 = 0)
•
Japanese microcode for Honeywell DPS virtual memory
• The microfilm plotter with the missing byte (1:1023)
• The Sun 3 page fault that IBM paid to fix
•
Left handed rotation in the graphics package
Good people work around problems.
The best people track them down and fix them!
30
CS 501 Spring 2002
Download