Fault-Tolerant Computer Systems

advertisement
HUMBOLDT-UNIVERSITÄT ZU BERLIN
INSTITUT FÜR INFORMATIK
DEPENDABLE SYSTEMS
Vorlesung 1
INTRODUCTION
Wintersemester 2000/2001
Leitung: Prof. Dr. Miroslaw Malek
www.informatik.hu-berlin.de/~rok/ftc
DS - IX - NFT - 1
FAULT-TOLERANT COMPUTING SYSTEMS
Topical Outline:
1. Introduction (Unit I)
–
–
–
–
Motivation
System views
Dependability rings
Dependable design methodology
2. Dependability Concepts, Measures and Models (UNIT DCMM)
–
–
–
–
–
Basic definitions
Dependability measures
Dependability models
Examples
Dependability evaluation tools
3. Testing Techniques (UNIT TT)
–
–
–
–
Testing techniques principles
Processor testing
Memory testing
Network testing
DS - IX - NFT - 2
FAULT-TOLERANT COMPUTING SYSTEMS
Topical Outline:
4. Fault Diagnosis Techniques (UNIT FST)
–
–
Fault detection techniques
Fault location (isolation) methods
5. Fault Recovery and Tolerance Techniques (UNIT FRTT) (System
Level)
–
–
–
Dynamic techniques
Static techniques
Hybrid techniques
6. Fault-tolerant and Fault-secure Memories (UNIT FRTT)
–
–
–
–
Fault-tolerant techniques in manufacturing
Replication
Coding
Reconfiguration
DS - IX - NFT - 3
FAULT-TOLERANT COMPUTING SYSTEMS
Topical Outline:
7. Network Fault Tolerance (UNIT NFT)
– Computer networks
– Basic techniques
– Example – multistage networks
8. Case Studies (UNIT CS)
–
–
–
–
–
ESS and 3B20
FTMP – Fault-tolerant Multiprocessor
SIFT – Software-implemented Fault Tolerance
Communication controller
Fault-tolerant Building Block Architecture
DS - IX - NFT - 4
COURSE ACTIVITIES
•
•
•
•
PROJECT
PRESENTATION
INVITED SPEAKERS
CONFERENCES AND WORKSHOPS
•
Some Websites:
–
–
–
–
www.dependability.org
www.paradise.caltech.edu
www.milan.eas.asu.edu
www.crhc.uiuc.edu
DS - IX - NFT - 5
Major References on Fault-tolerant Computing
(Books/General) 1
•
•
•
•
•
•
•
•
Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital
Systems, Wiley –Interscience, 1970.
Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits,
Prentice-Hall, 1971.
Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of
Digital Systems, Computer Science Press, 1976.
Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable
Design of Small Computers, Prentice-Hall, 1981.
Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice,
Prentice-Hall, 1982.
Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable
Systems Design, Digital Press, 1982 & 1995.
Lala, P.K., Fault Tolerant and Fault Testable Hardware Design,
Prentice-Hall International, 1985.
Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and
Techniques, Vols. I and II, Prentice-Hall, 1986.
DS - IX - NFT - 6
Major References on Fault-tolerant
Computing (Books/General) 2
•
•
•
•
•
•
•
Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of FaultTolerant Computing, Springer-Verlag, 1987.
Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems,
Addison-Wesley, 1989.
Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through
Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.
Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems,
Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag
Wien New York, 1992.
Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable
Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing
for Critical Applications 3, Springer-Verlag Wien New York, 1993.
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, System Implementation, Kluwer Academic Publishers, 1994.
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Paradigms for Dependable Applications, Kluwer Academic
Publishers, 1994.
DS - IX - NFT - 7
Major References on Fault-tolerant
Computing (Books/General) 3
•
•
•
•
•
•
•
•
Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Models and Frameworks for Dependable Systems, Kluwer
Academic Publishers, 1994.
Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994.
Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems,
Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic
Publishers, 1995.
Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and
Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical
Applications 4, Springer-Verlag Wien New York, 1995.
Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook
Binding, 1996.
A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997
W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999
S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser
Muenchen, 1999.
DS - IX - NFT - 8
Major References on Fault-tolerant
Computing (Books/Reliability Evaluation)
•
•
•
•
•
Myers, G. J., Software Reliability Principles and Practice, WileyInterscience, 1976.
Trivedi, K. S., Probability and Statistics with Reliability Queuing and
Computer Science Applications, Prentice-Hall, 1982.
Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel
Dekker, 1984.
Musa, J. D., A. Iannino and K. Okumoto, Software Reliability:
Measurement, Prediction, Application, McGraw-Hill, 1987.
W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999
DS - IX - NFT - 9
Major References on Fault-tolerant
Computing (Books/Coding)
•
•
•
•
•
•
Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for
Digital Computers, McGraw-Hill, 1968.
Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT
Press, 1972.
Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and
Applications, The Computer Science Library, 1978.
Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and
Application, Prentice-Hall, 1983.
Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting
Codes for Computer Scientist and Engineers, MacMillan Publishers,
1986.
Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer
Systems, Prentice-Hall, 1989.
DS - IX - NFT - 10
Major References on Fault-tolerant
Computing (Books/Software)
•
•
•
•
•
•
•
•
Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.
Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982.
Shooman, M. L., Software Engineering, McGraw-Hill, 1983.
Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.
Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control
and Recovery in Database Systems, Addison-Wesley, 1987.
Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.
Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.
Lyu, M. R. (ed.), Handbook of Software Reliability Engineering,
Computer Science Press, 1995.
DS - IX - NFT - 11
Major References on Fault-tolerant
Computing (Journals)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Special Issue of Proc. Of IEEE, October 1978
Special Issue of Computer, October 1979
Special Issue of Computer, March 1980
Special Issue of Computer, August 1984
Special Issue of IEEE Software, May 1995
IEEE Trans. on Reliability
IEEE Trans. On Software Engineering
Computer
Design and Test
Electronics
Proc. Of IEEE
Computer Design
Journal of Electronic Testing: Theory and Applications
Journal of Parallel and Distributed Computing
IEEE Trans. on Parallel and Distributed Computing
Real-Time Systems Journal
DS - IX - NFT - 12
Major References on Fault-tolerant
Computing (Conference Proceedings)
•
•
•
•
•
•
•
•
Fault-Tolerant Computing Symposium
Reliability and Maintainability Symposium
Reliability in Distributed Software and Database Systems Symposium
Test Conference
Distributed Computing Systems Conference
Parallel Processing Conference
Real-Time Systems Symposium
Computer Architecture Symposium
DS - IX - NFT - 13
INTRODUCTION
• OBJECTIVES:
– MOTIVATION FOR FAULT-TOLERANT SYSTEMS
– TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND
THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY
– TO PRESENT BASIC CONCEPTS AND APPROACHES
– TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY
• CONTENTS:
–
–
–
–
–
–
MOTIVATION
SYSTEM VIEWS
SYSTEM DEPENDABILITY CONCEPTS
APPROACHES TO DEPENDABLE DESIGN
DEPENDABILITY RINGS
DEPENDABLE DESIGN METHODOLOGY
DS - IX - NFT - 14
TYPES OF SYSTEMS
• Dependable (Reliable) System
– A system which delivers a required service during its lifetime
• Fault-Tolerant Computer Systems
– A system that has the capability to continue the correct execution of
its programs and input/output functions in the presence of faults
• Real-Time-Computer Systems
– are the ones that deliver service to a user within a specified
deadline (physical time, duration, etc.)
• Responsive Computer System
– are Fault-Tolerant Real-Time Systems that deliver satisfactory
service in a timely manner
DS - IX - NFT - 15
MOTIVATION FOR RELIABLE AND FAULTTOLERANT COMPUTING
• ECONOMIC NECESSITY
• LIFE SAVING
• NOVICE USERS
• HARSH ENVIRONMENTS
• MORE COMPLEX SYSTEMS
DS - IX - NFT - 16
DEVICE RELIABILITY AND SYSTEM RELIABILITY
Equivalent –
Device Reliability
106
Mean Time
between
Failures
(MTBF) in
Years
105
104
103
Minimum Acceptable
Reliability
102
10
1
System Reliability
1950
1960
1970
1980
1990
Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI
DS - IX - NFT - 17
DEPENDABILITY – PERFORMANCE TRADE-OFF
Ultra Reliable
Systems
Availability
0.99999
Commercial
Fault-Tolerant
Systems
0.9999
0.999
Massively Parallel/
Distributed Systems
0.99
0.9
1
10
100
1000
Throughput (MIPS)
DS - IX - NFT - 18
10000
100000
EXAMPLES
•
•
•
•
•
•
•
•
•
DEFENSE SYSTEMS
FLIGHT SYSTEMS
AIR TRAFFIC CONTROL
COMMUNICATION SYSTEMS
BANKING SYSTEMS
AIRLINE SEAT RESERVATIONS
TELEPHONE SYSTEMS
HOUSEHOLD APPLIANCES
VIDEO GAMES
DS - IX - NFT - 19
VIEW 1: SYSTEM LIFE CYCLE
SYSTEM
CONSTRAINTS
OBSOLESCENCE
NEEDS
NEW
TECHNOLOGY
CONCEPT FORMULATION
SYSTEM SPECIFICATION
DESIGN
PROTOTYPE
PRODUCTION
INSTALLATION
OPERATIONAL LIFE
MODIFICATION AND RETIREMENT
• Notice that testing, verification or validation should occur after every phase of life cycle
• Very few tools exist, and for some steps of the cycle only
DS - IX - NFT - 20
VIEW 2: PACKAGING LEVELS OF INTEGRATION
•
•
•
•
•
•
•
•
•
APPLICATIONS
APPLICATIONS MODULES
SPECIAL-PURPOSE LANGUAGES
STANDARD LANGUAGES
OPERATING SYSTEMS
CABINETS/FRAMES
BOXES/CAGES
PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs
INTEGRATED CIRCUITS (CHIPS)
• Dependability must be considered at every level
• System decomposition (partitioning) may have a significant
impact on dependability
DS - IX - NFT - 21
VIEW 3: WORKLOAD VIEW
LIVEWARE
PREPARATION
USEFU
L
WORK
SEMI
USEFUL
WORK
HARDWARE/
SOFTWARE
IDLING
FAULT
SERVICING
• ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY
DS - IX - NFT - 22
VIEW 4: LEVELS OF ABSTRACTION FOR
DIGITAL COMPUTERS
LEVEL
SUBLEVEL
PMS
COMPONENTS
Processors, Memories, Switches, Links
(Networks), Controllers, ALUs, I/Os
Program
HLL, ISP (Instraction Set
Processor
Software, Memory State, Processor State,
Effective Address Calculation, Instruction
Decode, Instruction Execution
Logic
Register Transfer Level (RTL)
Data Paths, Registers, Data Operators,
Control (Hardwired), Microprogramming
(Microstore)
Circuit
Resistors, Capacitors, Inductors, Power
Sources, Diodes
Transistors
Quantum & Electromagnetic
Disks, Tapes
• DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL
DS - IX - NFT - 23
VIEW 5: COMPUTER SYSTEM
SOFTWARE
LIVEWARE
PACKAGES
ASSEMBLERS
COMPILERS
MAINTENANCE PERSONNEL
OPERATING SYSTEMS
UTILITY PROGRAMS
OPERATORS
DEBUGGING PROGRAMS
FILE PROCESSING PROGRAMS
SYSTEM DESIGNERS
FIRMWARE
MICROPROGRAM & MICROPROSYSTEM ANALYSTS
GRAMMING SYSTEMS
HARDWARE
CPUs
PROGRAMMERS
I/O DEVICES
MEMORIES
USERS
INTERCONNECTION NETWORKS
FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%;
PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS)
DS - IX - NFT - 24
(WARNING!!!)
VIEW 6: IF YOU DO NOT FOLLOW
DEPENDABLE DESIGN METHODOLOGY
YOU MAY END UP WITH THE FOLLOWING:
SIX PHASES OF A PROJECT
1.
2.
3.
4.
5.
6.
ENTHUSIASM
DISILLUSIONMENT
PANIC AND HYSTERIA
SEARCH FOR THE GUILTY
PUNISHMENT OF THE INNOCENT
PRAISE AND AWARDS FOR THE NON-PARTICIPANTS
(Author unknown – found in one of the computer companies)
DS - IX - NFT - 25
SYSTEM DEPENDABILITY CONCEPTS
•
RELIABILITY
– Is a conditional probability that the system will perform its intended function
without failure at time t provided it was fully operational at time t = 0
•
AVAILABILITY
–
Instantaneous availability is the probability that a system is performing
correctly at time t and is equal to reliability of non-repairable systems
A (t) = R (t)
– Steady-state availability is the probability that a system will be operational
at any random point of time and is expressed as the fraction of time a
system is operational during its expected lifetime
As (t) =
•
UPTIME
LIFETIME
SURVIVABILITY is the probability that a system will deliver the required
service in the presence of a defined a priori set of faults or any of its
subset
DS - IX - NFT - 26
APPROACHES
• FAULT INTOLERANCE
• FAULT TOLERANCE
• MAINTAINABILITY
• HARDWARE/SOFTWARE TRADE-OFFS
DS - IX - NFT - 27
HARDWARE/SOFTWARE CONTINUUM AND
VERTICAL MIGRATION
HARDWARE
EXAMPLES
INSTRUCTIONS
INTEGER ARITHMETIC ADD/SUB
MPY/DIV
FLOATING-POINT ARITHMETIC
VECTOR PROCESSING
MULTIPROCESSING (e.g.,
submachine set-up)
M6800
MC68000
VAX-11/780 IBM-30XX
CRAY-XMP C-205
SYSTOLIC ARRAYS,
RECONFIGURABLE OR
EXPERIMENTAL
MULTICOMPUTERS
SOFTWARE
VERTICAL MIGRATION is a transfer of functions’ implementation from software to
firmware and/or hardware or vice-versa.
Vertical Migration improves performance and dependability, and reduces cost.
DS - IX - NFT - 28
DEPENDABILITY (RELIABILITY) RINGS FOR
FAULT TOLERANCE
Dependability
Acceptance Test
Rings
Operating System,
Languages and Application
Acceptance Test
System Hardware
Acceptance Test
Register-Transfer Level
Acceptance Test
Logic Level
Each Dependability Ring should provide measures and mechanisms for Fault
Tolerance (Detection, Location, Testability and Recovery)
DS - IX - NFT - 29
A BOOTSTRAP – TEST RINGS IN A
MULTICOMPUTER SYSTEM
Network
Memories
Processor
Diagnostic
and
Maintenance
Processor (s)
(Hardcore)
Test Rings
DS - IX - NFT - 30
DEPENDABLE DESIGN METHODOLOGY
• Identify fault classes, fault latency and fault impact
• Determine qualitative and quantitative specs for fault tolerance
and evaluate your design in specific environment
• Identify “weak spots” and assess potential damage
• Decompose the system
• Develop fault and error detection techniques and algorithms
• Develop fault isolation techniques and algorithms
• Develop recovery/reintegration/restart
• Evaluate degree of fault tolerance
• Refine, iterate for improvement; try to eliminate “weak spots”
and minimize potential damage
DS - IX - NFT - 31
REAL-TIME SYSTEMS DESIGN
• Identify time/critical tasks and specify their timing (deadlines,
durations, frequency, periodicity, if any). Characterize the
system load and environment.
• Characterize timing of a system (hardware and software).
• Map timing specification onto a system timing (find the best
resource allocation and scheduling methods), and incorporate
concurrent monitoring.
• Verify and validate the design for quantitative and qualitative
specifications.
• Refine, iterate and fine-tune the design.
DS - IX - NFT - 32
RESPONSIVE SYSTEM DESIGN
•
•
•
•
•
•
•
•
Determine qualitative and quantitative specifications for fault tolerance
and task timeliness which meet user requirements.
Determine system timing (hardware and software) assess damage,
availability and responsiveness.
Develop and time fault and error detection techniques and algorithms.
Develop and time fault isolation techniques and algorithms.
Develop time recovery/reintegration/restart.
Map timing specification onto system timing under appropriate
assumptions and incorporate concurrent monitoring.
Evaluate responsiveness.
Refine and iterate for improvement.
RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND
ARCHITECTS OF TIME
DS - IX - NFT - 33
REFERENCES
(TEXTBOOK)
• C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of
Computer Systems”, Chapter 1 in the book by the same authors
titled “Computer Engineering”, Digital Press, 1978.
• G.J. Lipovski and M. Malek, “Parallel Computing: Theory and
Comparisons”, Wiley-Interscience, New York, 1987.
• M. Malek, “Parallel Computer Systems Testing and Integration”,
in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G.
Sami and F. Lombardi (eds.), Kluwer, 1988.
• Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook
Binding / Published 1994
•
Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook
Binding, 1996.
DS - IX - NFT - 34
Download