Outline TDDB47 Real-time Systems Lecture 4: Scheduling(II) cont’d & Dependability Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Undergraduate course on Real-time Systems Linköping University 36 pages Autumn 2005 Part 1: • Continuing with ICP and Deadlocks • Towards more dynamic scheduling: – Offloading non-critical tasks to other processors (distributed scheduling) Part 2: • Dependable Systems Undergraduate course on Real-time Systems Linköping University Reading material (part 1) • Chapter 13 of Burns & Wellings, in particular 13.11. • Background reading on deadlocks 2 of 36 Autumn 2005 ICP & Deadlock • The ICP prevents deadlocks (How?) • Moreover, it prevents starvation (How?) • Article by Ramamritham, Stankovic, and Zhao, IEEE Transactions on Computers, Volume 38(8), August 1989. For the evaluation results concentrating on section V.B. Undergraduate course on Real-time Systems Linköping University 3 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University Deadlock prevention 4 of 36 Autumn 2005 Starvation • Prevention technique – allocate all necessary resources at once, before execution – Drawbacks? ... ... ... ... ... ...... – Recall the new problem created with the dining philosopher problem… Undergraduate course on Real-time Systems Linköping University 5 of 36 Autumn 2005 Starvation/lockout happens if some process never gets hold of the resources it needs despite the fact that the resources are not constantly engaged Undergraduate course on Real-time Systems Linköping University 6 of 36 Autumn 2005 What is meant by liveness? • Liveness – i.e. absence of deadlock, starvation and livelock is necessary in real-time systems Now back to scheduling... • Immediate ceiling protocol (ICP) is deadlock preventing ... • But not sufficient... ... • If this can be guaranteed then the system is live – intuitively, the good things that should happen will happen sooner or later Undergraduate course on Real-time Systems Linköping University 7 of 36 Autumn 2005 ... ... ... Undergraduate course on Real-time Systems Linköping University ICP & Deadlock • The ICP prevents deadlocks (How?) • Moreover, it prevents starvation (How?) ... ... 8 of 36 Autumn 2005 Distributed Scheduling • Relax the restriction on a-priori fixed task sets and arrival times • Consider CPU and all other resources a task might need to complete • Can a task be offloaded from the processor it arrives at? Undergraduate course on Real-time Systems Linköping University 9 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University Bidding • Task T arrives at node Ni • If Ni cannot guarantee T meeting its deadline, it will ask some nodes to bid for running T Nk Successful bid T Nodes with sufficient surplus according to Ni knowledge Undergraduate course on Real-time Systems Linköping University Focused addressing • Task T arrives at node Ni • If Ni cannot guarantee T, it looks for a node that has a surplus resource level above a fixed limit (Focused Addressing Surplus- FAS) Nj T Ni 11 of 36 Autumn 2005 10 of 36 Autumn 2005 Surplus > FAS Ni Undergraduate course on Real-time Systems Linköping University 12 of 36 Autumn 2005 A combined approach • None of earlier methods guarantees successful scheduling • Combine heuristics to increase chances Successful bid Reading material (part 2) • Chapter 5 of Burns & Wellings • Dependable Systems: IFIP terminology as described in [Avizienis et.al. 2004] Surplus > FAS T Undergraduate course on Real-time Systems Linköping University 13 of 36 Autumn 2005 Dependability • How can we produce systems that do their job, and how to measure how well they do their jobs? • How do things go wrong and why? • What can we do about it? – This lecture: Basic overview of faulttolerant systems – Next lecture: Designing Dependable Real-time systems Undergraduate course on Real-time Systems Linköping University 15 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University 14 of 36 Autumn 2005 Early computer systems • 1944: Real-time computer system in the Whirlwind project at MIT, used in a military air traffic control system 1951 • Short life of vaccum tubes gave mean time to failure of 20 minutes Undergraduate course on Real-time Systems Linköping University 16 of 36 Autumn 2005 Early space and avionics • During 1955, 18 air carrier accidents in the USA (when only 20% of the public was willing to fly!) • 1970: Apollo 13 had less computing power on board than a PC produced ten years later Undergraduate course on Real-time Systems Linköping University 17 of 36 Autumn 2005 ts ll, en Sti ccid ! a ual s/ ap nus h u s mi not Undergraduate course on Real-time Systems Linköping University 18 of 36 Autumn 2005 June – January 1987 • Six patients in USA and Canada got very high doses of radiation and severe burns from the cancertreatment system Therac 25. • Doses as high as 15,000-20,000 radiation units compared with the normal levels (~ 200 units) had been given. Three died. Undergraduate course on Real-time Systems Linköping University 19 of 36 Autumn 2005 3rd February 1994 • TCAS is a system designed to avoid mid-air collisions between passenger planes. • Two commercial aircrafts came as close as 1.6 km to each other while flying over Oregon in USA. Undergraduate course on Real-time Systems Linköping University 20 of 36 Autumn 2005 What is dependability? • ”Friendly Fire” - during the Gulf war 24% of American soldiers (35 av 146) killed by own systems. Undergraduate course on Real-time Systems Linköping University 21 of 36 Autumn 2005 Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] Undergraduate course on Real-time Systems Linköping University 22 of 36 Autumn 2005 Reliability Attributes of dependability [Sv. Pålitlighet] [Sv. Tillförlitlighet] IFIP WG 10.4 definitions: • Safety: non-occurance of catastrophic consequences on the environment • Availability: the readiness for usage • Integrity: non-occurance of unauthorized alteration of information • Reliability: continuity of correct service Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time. Typical measure in aerospace: 10-9 i.e. One failure in 109 flight hours. Undergraduate course on Real-time Systems Linköping University 23 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University 24 of 36 Autumn 2005 Faults, Errors & Failures • Fault: a defect within the system or a situation that can lead to failure • Error: manifestation (symptom) of the fault - an unexpected behaviour • Failure: system not performing its intended function Examples • Year 2000 bug • Bit flips in hardware due to cosmic radiation in space • Loose wire • Air craft retracting its landing gear while on ground Effects in time: Permanent/ transient/ intermittent Undergraduate course on Real-time Systems Linköping University 25 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University Fault ⇒ Error ⇒ Failure • Goal of system verification and validation is to eliminate faults 26 of 36 Autumn 2005 More on dependability Four approaches [IFIP 10.4]: Some will remain… 1. 2. 3. 4. • Goal of safety/risk analysis is to focus on important faults Fault Fault Fault Fault avoidance removal tolerance forecasting • Goal of fault tolerance is to reduce effects of errors if they appear eliminate or delay failures Undergraduate course on Real-time Systems Linköping University 27 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University Fault tolerance 28 of 36 Autumn 2005 External factors • Means that a system provides a degraded (but acceptable) function – Even in presence of faults – During a period defined by certain model assumptions The film… • Foreseen or unforeseen? Undergraduate course on Real-time Systems Linköping University 29 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University 30 of 36 Autumn 2005 Types of failures • Node failures – Crash – Omission – Byzantine • Channel failures – Crash (and potential partitions) – Message loss – Erroneous/arbitrary messages Undergraduate course on Real-time Systems Linköping University On-line fault-management • Fault detection – By program or its environment • Fault tolerance (containment) using redundancy – software – hardware – data 31 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University Redundancy 32 of 36 Autumn 2005 Static Redundancy From D. Lardner: Edinburgh Review, year 1824: ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” Used in all cases (whether an error has appeared or not), just in case… – SW: N-version programming – HW: Voting systems – Data: parity bits, checksums * people who compute Undergraduate course on Real-time Systems Linköping University 33 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University 34 of 36 Autumn 2005 Dynamic Redundancy Used when error appears and has to be treated – SW: Recovery methods – HW: Switching to back-up module – Data: Self-correcting codes – Time: Re-computing a result Undergraduate course on Real-time Systems Linköping University Questions? 35 of 36 Autumn 2005 Undergraduate course on Real-time Systems Linköping University 36 of 36 Autumn 2005