HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc DS - IX - NFT - 1 FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 1. Introduction (Unit I) – – – – Motivation System views Dependability rings Dependable design methodology 2. Dependability Concepts, Measures and Models (UNIT DCMM) – – – – – Basic definitions Dependability measures Dependability models Examples Dependability evaluation tools 3. Testing Techniques (UNIT TT) – – – – Testing techniques principles Processor testing Memory testing Network testing DS - IX - NFT - 2 FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 4. Fault Diagnosis Techniques (UNIT FST) – – Fault detection techniques Fault location (isolation) methods 5. Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level) – – – Dynamic techniques Static techniques Hybrid techniques 6. Fault-tolerant and Fault-secure Memories (UNIT FRTT) – – – – Fault-tolerant techniques in manufacturing Replication Coding Reconfiguration DS - IX - NFT - 3 FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 7. Network Fault Tolerance (UNIT NFT) – Computer networks – Basic techniques – Example – multistage networks 8. Case Studies (UNIT CS) – – – – – ESS and 3B20 FTMP – Fault-tolerant Multiprocessor SIFT – Software-implemented Fault Tolerance Communication controller Fault-tolerant Building Block Architecture DS - IX - NFT - 4 COURSE ACTIVITIES • • • • PROJECT PRESENTATION INVITED SPEAKERS CONFERENCES AND WORKSHOPS • Some Websites: – – – – www.dependability.org www.paradise.caltech.edu www.milan.eas.asu.edu www.crhc.uiuc.edu DS - IX - NFT - 5 Major References on Fault-tolerant Computing (Books/General) 1 • • • • • • • • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970. Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971. Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976. Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981. Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982. Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995. Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985. Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986. DS - IX - NFT - 6 Major References on Fault-tolerant Computing (Books/General) 2 • • • • • • • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of FaultTolerant Computing, Springer-Verlag, 1987. Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989. Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989. Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992. Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993. Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, System Implementation, Kluwer Academic Publishers, 1994. Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994. DS - IX - NFT - 7 Major References on Fault-tolerant Computing (Books/General) 3 • • • • • • • • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Computing, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994. Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994. Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995. Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995. Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997 W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999 S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999. DS - IX - NFT - 8 Major References on Fault-tolerant Computing (Books/Reliability Evaluation) • • • • • Myers, G. J., Software Reliability Principles and Practice, WileyInterscience, 1976. Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982. Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984. Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987. W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999 DS - IX - NFT - 9 Major References on Fault-tolerant Computing (Books/Coding) • • • • • • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968. Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972. Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978. Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983. Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986. Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989. DS - IX - NFT - 10 Major References on Fault-tolerant Computing (Books/Software) • • • • • • • • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970. Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982. Shooman, M. L., Software Engineering, McGraw-Hill, 1983. Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983. Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993. Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995. Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995. DS - IX - NFT - 11 Major References on Fault-tolerant Computing (Journals) • • • • • • • • • • • • • • • • Special Issue of Proc. Of IEEE, October 1978 Special Issue of Computer, October 1979 Special Issue of Computer, March 1980 Special Issue of Computer, August 1984 Special Issue of IEEE Software, May 1995 IEEE Trans. on Reliability IEEE Trans. On Software Engineering Computer Design and Test Electronics Proc. Of IEEE Computer Design Journal of Electronic Testing: Theory and Applications Journal of Parallel and Distributed Computing IEEE Trans. on Parallel and Distributed Computing Real-Time Systems Journal DS - IX - NFT - 12 Major References on Fault-tolerant Computing (Conference Proceedings) • • • • • • • • Fault-Tolerant Computing Symposium Reliability and Maintainability Symposium Reliability in Distributed Software and Database Systems Symposium Test Conference Distributed Computing Systems Conference Parallel Processing Conference Real-Time Systems Symposium Computer Architecture Symposium DS - IX - NFT - 13 INTRODUCTION • OBJECTIVES: – MOTIVATION FOR FAULT-TOLERANT SYSTEMS – TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY – TO PRESENT BASIC CONCEPTS AND APPROACHES – TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY • CONTENTS: – – – – – – MOTIVATION SYSTEM VIEWS SYSTEM DEPENDABILITY CONCEPTS APPROACHES TO DEPENDABLE DESIGN DEPENDABILITY RINGS DEPENDABLE DESIGN METHODOLOGY DS - IX - NFT - 14 TYPES OF SYSTEMS • Dependable (Reliable) System – A system which delivers a required service during its lifetime • Fault-Tolerant Computer Systems – A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults • Real-Time-Computer Systems – are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.) • Responsive Computer System – are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner DS - IX - NFT - 15 MOTIVATION FOR RELIABLE AND FAULTTOLERANT COMPUTING • ECONOMIC NECESSITY • LIFE SAVING • NOVICE USERS • HARSH ENVIRONMENTS • MORE COMPLEX SYSTEMS DS - IX - NFT - 16 DEVICE RELIABILITY AND SYSTEM RELIABILITY Equivalent – Device Reliability 106 Mean Time between Failures (MTBF) in Years 105 104 103 Minimum Acceptable Reliability 102 10 1 System Reliability 1950 1960 1970 1980 1990 Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI DS - IX - NFT - 17 DEPENDABILITY – PERFORMANCE TRADE-OFF Ultra Reliable Systems Availability 0.99999 Commercial Fault-Tolerant Systems 0.9999 0.999 Massively Parallel/ Distributed Systems 0.99 0.9 1 10 100 1000 Throughput (MIPS) DS - IX - NFT - 18 10000 100000 EXAMPLES • • • • • • • • • DEFENSE SYSTEMS FLIGHT SYSTEMS AIR TRAFFIC CONTROL COMMUNICATION SYSTEMS BANKING SYSTEMS AIRLINE SEAT RESERVATIONS TELEPHONE SYSTEMS HOUSEHOLD APPLIANCES VIDEO GAMES DS - IX - NFT - 19 VIEW 1: SYSTEM LIFE CYCLE SYSTEM CONSTRAINTS OBSOLESCENCE NEEDS NEW TECHNOLOGY CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PROTOTYPE PRODUCTION INSTALLATION OPERATIONAL LIFE MODIFICATION AND RETIREMENT • Notice that testing, verification or validation should occur after every phase of life cycle • Very few tools exist, and for some steps of the cycle only DS - IX - NFT - 20 VIEW 2: PACKAGING LEVELS OF INTEGRATION • • • • • • • • • APPLICATIONS APPLICATIONS MODULES SPECIAL-PURPOSE LANGUAGES STANDARD LANGUAGES OPERATING SYSTEMS CABINETS/FRAMES BOXES/CAGES PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs INTEGRATED CIRCUITS (CHIPS) • Dependability must be considered at every level • System decomposition (partitioning) may have a significant impact on dependability DS - IX - NFT - 21 VIEW 3: WORKLOAD VIEW LIVEWARE PREPARATION USEFU L WORK SEMI USEFUL WORK HARDWARE/ SOFTWARE IDLING FAULT SERVICING • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY DS - IX - NFT - 22 VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS LEVEL SUBLEVEL PMS COMPONENTS Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os Program HLL, ISP (Instraction Set Processor Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution Logic Register Transfer Level (RTL) Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore) Circuit Resistors, Capacitors, Inductors, Power Sources, Diodes Transistors Quantum & Electromagnetic Disks, Tapes • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL DS - IX - NFT - 23 VIEW 5: COMPUTER SYSTEM SOFTWARE LIVEWARE PACKAGES ASSEMBLERS COMPILERS MAINTENANCE PERSONNEL OPERATING SYSTEMS UTILITY PROGRAMS OPERATORS DEBUGGING PROGRAMS FILE PROCESSING PROGRAMS SYSTEM DESIGNERS FIRMWARE MICROPROGRAM & MICROPROSYSTEM ANALYSTS GRAMMING SYSTEMS HARDWARE CPUs PROGRAMMERS I/O DEVICES MEMORIES USERS INTERCONNECTION NETWORKS FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS) DS - IX - NFT - 24 (WARNING!!!) VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING: SIX PHASES OF A PROJECT 1. 2. 3. 4. 5. 6. ENTHUSIASM DISILLUSIONMENT PANIC AND HYSTERIA SEARCH FOR THE GUILTY PUNISHMENT OF THE INNOCENT PRAISE AND AWARDS FOR THE NON-PARTICIPANTS (Author unknown – found in one of the computer companies) DS - IX - NFT - 25 SYSTEM DEPENDABILITY CONCEPTS • RELIABILITY – Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0 • AVAILABILITY – Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t) – Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime As (t) = • UPTIME LIFETIME SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset DS - IX - NFT - 26 APPROACHES • FAULT INTOLERANCE • FAULT TOLERANCE • MAINTAINABILITY • HARDWARE/SOFTWARE TRADE-OFFS DS - IX - NFT - 27 HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION HARDWARE EXAMPLES INSTRUCTIONS INTEGER ARITHMETIC ADD/SUB MPY/DIV FLOATING-POINT ARITHMETIC VECTOR PROCESSING MULTIPROCESSING (e.g., submachine set-up) M6800 MC68000 VAX-11/780 IBM-30XX CRAY-XMP C-205 SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS SOFTWARE VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa. Vertical Migration improves performance and dependability, and reduces cost. DS - IX - NFT - 28 DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE Dependability Acceptance Test Rings Operating System, Languages and Application Acceptance Test System Hardware Acceptance Test Register-Transfer Level Acceptance Test Logic Level Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery) DS - IX - NFT - 29 A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM Network Memories Processor Diagnostic and Maintenance Processor (s) (Hardcore) Test Rings DS - IX - NFT - 30 DEPENDABLE DESIGN METHODOLOGY • Identify fault classes, fault latency and fault impact • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment • Identify “weak spots” and assess potential damage • Decompose the system • Develop fault and error detection techniques and algorithms • Develop fault isolation techniques and algorithms • Develop recovery/reintegration/restart • Evaluate degree of fault tolerance • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage DS - IX - NFT - 31 REAL-TIME SYSTEMS DESIGN • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment. • Characterize timing of a system (hardware and software). • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring. • Verify and validate the design for quantitative and qualitative specifications. • Refine, iterate and fine-tune the design. DS - IX - NFT - 32 RESPONSIVE SYSTEM DESIGN • • • • • • • • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements. Determine system timing (hardware and software) assess damage, availability and responsiveness. Develop and time fault and error detection techniques and algorithms. Develop and time fault isolation techniques and algorithms. Develop time recovery/reintegration/restart. Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring. Evaluate responsiveness. Refine and iterate for improvement. RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME DS - IX - NFT - 33 REFERENCES (TEXTBOOK) • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978. • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987. • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988. • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994 • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. DS - IX - NFT - 34