Presentation of Licentiate Thesis Scheduling and Optimization of FaultTolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden 1 of 14 1 Motivation Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. Focus on transient faults and intermittent faults 2 of 14 2 Motivation Transient faults Happen for a short time Corruptions of data, miscalculation in logic Do not cause a permanent damage of circuits Electromagnetic interference (EMI) Causes are outside system boundaries Radiation Lightning storms 3 of 14 3 Motivation Intermittent faults Transient faults Manifest similar as transient faults Happen repeatedly Causes are inside system Crosstalk boundaries Internal EMI Init (Data) Power supply fluctuations Software errors (Heisenbugs) 4 of 14 4 Motivation Transient faults are more likely to occur as the size of transistors is shrinking and the frequency is growing Errors caused by transient faults have to be tolerated before they crash the system However, fault tolerance against transient faults leads to significant performance overhead 5 of 14 5 Motivation Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. The Need for Design Optimization of Embedded Systems with Fault Tolerance 6 of 14 6 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work 7 of 14 7 General Design Flow System Specification Architecture Selection Mapping & Hardware / Software Partitioning Fault Tolerance Techniques Feedback loops Scheduling Back-end Synthesis 8 of 14 8 Fault Tolerance Techniques Re-execution Error-detection overhead N1 N1 PP1/1 1 P1/2 Recovery overhead Rollback recovery with checkpointing Checkpointing overhead N1 1 P1 0 2 P1 P1 20 N1 1 P1 40 1 2 60 PP1/1 1 2 P1/2 Active replication N1 P1(1) N1 P1(1) N2 P1(2) N2 P1(2) 9 of 14 9 Limitations of Previous Work Design optimization with fault tolerance is limited Process mapping is not considered together with fault tolerance issues Multiple faults are not addressed in the framework of static cyclic scheduling Transparency, if at all addressed, is restricted to a whole computation node 10 10 of 14 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work 11 11 of 14 Fault-Tolerant Time-Triggered Systems Transient faults Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing … P1 m1 Messages: Fault-tolerant predictable protocol m2 P5 P2 P3 P4 Maximum k transient faults within each application run (system period) 12 12 of 14 Scheduling with Fault Tolerance Reqirements Conditional Scheduling Shifting-based Scheduling 13 13 of 14 Conditional Scheduling k=2 P1 P1 m1 0 20 40 60 80 100 120 140 160 180 200 P2 true P1 0 P2 14 14 of 14 Conditional Scheduling k=2 P1 P1 m1 0 20 P2 40 60 80 100 120 140 160 180 200 P2 true FP1 P1 0 P2 40 15 15 of 14 Conditional Scheduling k=2 P1 PP1/1 1 m1 0 20 P1/2 40 60 80 100 120 140 160 180 200 P2 true FP1 FP1 45 P1 0 P2 40 16 16 of 14 Conditional Scheduling k=2 P1 P1/1 m1 0 20 P1/2 40 60 P1/3 80 P2 100 120 140 160 180 200 P2 true FP1 FP1 FP1 Ù FP1 45 P1 0 P2 40 90 130 17 17 of 14 Conditional Scheduling k=2 P1 P1/1 m1 0 20 P1/2 40 60 PP2/1 2 80 P2/2 100 120 140 160 180 200 P2 true FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2 45 P1 0 P2 40 90 130 85 140 18 18 of 14 Conditional Scheduling k=2 P1 P1 m1 0 20 PP2/1 2 40 60 P2/2 80 P2/3 100 120 140 160 180 200 P2 true FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2 FP1 Ù FP2 FP Ù FP Ù FP 1 45 P1 0 P2 40 2 2 90 130 85 140 95 150 19 19 of 14 Fault-Tolerance Conditional Process Graph 1 P1 1 k=2 P1 m1 P2 FP1 1 FP1 P2 m11 2 P1 m12 2 2 FP2 P2 4 P2 2 FP2 1 3 P1 3 m1 FP4 2 3 P2 P25 6 P2 Conditional Scheduling 20 20 of 14 Conditional Schedule Table P1 N1 m1 k=2 N2 P2 true P1 FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2 FP1 Ù FP2 FP Ù FP Ù FP 1 45 0 2 2 90 m1 40 130 85 P2 50 140 95 150 105 160 21 21 of 14 Conditional Scheduling Conditional scheduling: Generates short schedules Allows to trade-off between transparency and performance (to be discussed later...) – Requires a lot of memory to store schedule tables – Scheduling algorithm is very slow Alternative: shifting-based scheduling 22 22 of 14 Shifting-based Scheduling Messages sent over the bus should be scheduled at one time Faults on one computation node must not affect other computation nodes Requires less memory Schedule generation is very fast – Schedules are longer – Does not allow to trade-off between transparency and performance (to be discussed later...) 23 23 of 14 Ordered FT-CPG 1 P1 k=2 P2 after P1 P1 P2 m1 m3 m2 1 P1 4 2 P2 P2 3 P2 P2 P4 P3 2 P2 5 S S m2 m1 P3 after P4 3 P1 6 P2 S m3 1 P4 1 P3 2 P3 3 P3 2 P4 4 P3 5 P3 3 P4 5 P3 24 24 of 14 Root Schedules Recovery slack for P1 and P2 Bus P2 P1 P1 Worst-case scenario for P1 P4 P3 m3 N2 P1 m1 m2 N1 25 25 of 14 Extracting Execution Scenarios P1 P2 N2 m1 m2 Bus P4/1 P4/2 P4/3 P3 m3 N1 26 26 of 14 Memory Required to Store Schedule Tables 20 proc. 40 proc. 60 proc. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 80 proc. k=2 k=3 1.73 0.71 2.09 4.35 1.18 4.21 8.75 100% 0.13 0.28 0.54 0.36 0.891.73 4.96 1.20 4.64 11.55 2.01 8.40 21.11 75% 0.22 0.57 1.37 0.62 2.064.96 8.09 1.53 7.09 18.28 2.59 12.21 34.46 50% 0.28 0.82 1.94 0.82 3.118.09 12.56 1.92 10.00 28.31 3.05 17.30 51.30 25% 0.34 1.17 2.95 1.03 4.34 12.56 16.72 2.16 11.72 34.62 3.41 19.28 61.85 0% 0.39 1.42 3.74 1.17 5.61 16.72 Applications with more frozen nodes require less memory 27 27 of 14 Memory Required to Store Root Schedule 20 proc. 40 proc. 60 proc. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 100% 0.016 0.034 0.03 0.054 k=3 k=1 80 proc. k=2 k=3 0.070 1.73 Shifting-based scheduling requires very little memory 28 28 of 14 Schedule Generation Time and Quality Shifting-based scheduling requires 0.2 seconds to generate a root schedule for application of 120 processes and 10 faults Conditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults Shifting-based scheduling much faster than conditional scheduling ~15% worse than conditional scheduling with 100% inter-processor messages set to frozen (in terms of fault tolerance overhead) 29 29 of 14 Fault Tolerance Policy Assignment Checkpoint Optimization 30 30 of 14 Fault Tolerance Policy Assignment 2 N1 P1(1) N1 P1(1)/1 N1 P1/1 P1/2 P1/3 P1(1)/2 N2 P1(2) N2 P1(2) N3 P1(3) Re-execution Replication Re-executed replicas 31 31 of 14 Re-execution vs. Replication Deadline N2 P2(1) P1(2) Missed N1 N2 P3(2) Re-execution is better N1 P1 N2 P1(1) P2 P2(1) P1(2) bus m1(1) m1(2) bus P2(2) P3(1) P3(1) P2(2) P3(2) Met m2(1) m2(2) P1(1) m1(1) m1(2) N1 Deadline Replication is better Met N1 P1 P2 P3 Missed N2 P3 bus bus P1 m1 A1 P2 P3 N1 N2 N1 N2 P1 40 50 P2 40 50 P3 60 70 1 A2 P1 m1 P2 m2 P3 32 32 of 14 Fault Tolerance Policy Assignment Deadline m2 P1 m1 P1(2) P4 PP4 3(1) PP2(2) 3 P4(1) P3 P3(2) P4(2) P3 m3 P2 MetMissed Missed Optimization of fault tolerance policy assignment m3(1) m3(2) bus P2 PP2(1) 2 m2 N2 PP1(1) 1 m m2(1) 1(1) m1(2) m2(1) m2(2) N1 P4 P1 P2 P3 P4 N1 40 60 60 40 N2 50 80 80 50 1 N1 N2 33 33 of 14 Optimization Strategy Design optimization: Fault tolerance policy assignment Mapping of processes and messages Tabu-search Root schedules Shifting-based scheduling Three tabu-search optimization algorithms: 1. Mapping and Fault Tolerance Policy assignment (MRX) Re-execution, replication or both 2. Mapping and only Re-Execution (MX) 3. Mapping and only Replication (MR) 34 34 of 14 Experimental Results Schedulability improvement under resource constraints Avgerage % deviation from MRX 100 90 Mapping and replication (MR) 80 70 60 50 40 30 Mapping and re-execution (MX) 20 10 0 Mapping and policy assignment (MRX) 20 40 60 80 100 Number of processes 35 35 of 14 Checkpoint Optimization N1 P1 1 P1 1 P1 22 PP1/1 1 2 P1/2 2 P1 36 36 of 14 Locally Optimal Number of Checkpoints No. of checkpoints 1 2 k=2 P1 1 2 P1 1 3 P1 4 P1 5 P1 1 1 1 = 5 ms P1 2 3 P1 2 P1 2 P1 1 = 10 ms P1 3 P1 3 P1 1 = 15 ms 4 P1 4 P1 5 P1 P1 C1 = 50 ms 37 37 of 14 Globally Optimal Number of Checkpoints 1 P11 P1 1 P2 2 P1 2 P1 2 P2 3 P1 1 P2 3 P2 2 P2 265 k=2 P1 m1 P2 P1 P2 255 10 5 10 10 5 10 P1 P2 C1 = 50 ms C2=60 ms 38 38 of 14 Globally Optimal Number of Checkpoints a) b) 1 P1 2 P1 1 P1 3 P1 2 P1 1 P2 1 P2 k=2 P1 m1 P2 P1 P2 10 5 10 10 5 10 2 P2 3 P2 265 255 2 P2 P1 P2 C1 = 50 ms C2=60 ms 39 39 of 14 Globally Optimal Number of Checkpoints a) b) 1 P1 2 P1 1 P1 3 P1 2 P1 1 P2 1 P2 k=2 P1 m1 P2 P1 P2 10 5 10 10 5 10 2 P2 3 P2 265 255 2 P2 P1 P2 C1 = 50 ms C2=60 ms 40 40 of 14 % deviation from MC0 (how smaller the fault tolerance overhead) Global Optimization vs. Local Optimization 40% Does the optimization reduce the fault tolerance overheads on the schedule length? 30% 4 nodes, 3 faults 20% Global Optimization of Checkpoint Distribution (MC) 10% Local Optimization of Checkpoint Distribution (MC0) 0% 40 60 80 100 Application size (the number of tasks) 41 41 of 14 Trading-off Transparency for Performance Mapping Optimization with Transparency 42 42 of 14 FT Implementations with Transparency – regular processes/messages – frozen processes/messages P1 m1 Frozen m2 P5 P2 P3 P4 Transparency is achieved with frozen processes and messages Good for debugging and testing 43 43 of 14 No Transparency processes start at different times P1 N2 P4 m1 m2 bus N1 no fault scenario P2 P1 messages are sent at different times P3 m3 N1 P1 the worst-case fault scenario P2 N2 N2 m3 bus m1 m2 P4 N1 Deadline P1 P2 P3 P4 N1 30 20 X X P4 N2 X X 20 30 P3 = 5 ms k=2 P2 m3 P3 m2 P1 m1 P4 44 44 of 14 Customized Full Transparency Transparency Deadline P4 PP44 P3 m1 m2 m1 m2 no fault scenario No transparency P2 P3 m3 P2 P1 m3 P1 P1 Deadline Full transparency P2 P1 P3 P3 Customized transparency P2 P4 P3 P3 m3 P3 m2 m1 P1 P3 m3 m1 m2 P4 45 45 of 14 Trading-Off Transparency for Performance increasing transparency 0% 25% 50% 75% 100% k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139 29 43 20 40 40 58 28 49 49 72 34 60 60 90 39 66 66 97 40 17 29 60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86 80 8 16 22 10 18 29 14 27 39 24 41 66 27 43 73 How longer is the Trading transparency forlength performance Fourschedule (4) computation nodes with is essential Recovery time 5 ms fault tolerance? 46 46 of 14 Mapping with Transparency Deadline N1 P1 bus N1 P4/1 bus P1 P5 optimal mapping without transparency P6 P4/2 P3 P4/3 P2 P5 the worst-case fault scenario for optimal mapping P6 m1 N2 P2 P3 m1 N2 P4 N1 N2 P1 P2 P3 P4 P5 P6 N1 30 40 50 60 40 50 N2 30 40 50 60 40 50 = 10 ms k=2 m1 P1 m2 P2 P3 m3 P5 P4 m4 P6 47 47 of 14 Mapping with Transparency Deadline N1 P1 N1 N2 bus P3 P2/2 P2/3 P5 the worst-case fault scenario with transparency for “optimal” mapping P6 m1 bus P2/1 P1 P2 P4/1 P3 the worst-case fault scenario with transparency and optimized mapping P5 P4/2 P4/3 P6 m2 N2 P4 N1 N2 P1 P2 P3 P4 P5 P6 N1 30 40 50 60 40 50 N2 30 40 50 60 40 50 = 10 ms k=2 m1 P1 m2 P2 P3 m3 P5 P4 m4 P6 48 48 of 14 Design Optimization Hill-climbing mapping optimization heuristic Schedule length 1. Conditional Scheduling (CS) Slow 2. Schedule Length Estimation (SE) Fast 49 49 of 14 Experimental Results 25% of processes and 50% of messages are frozen 4 nodes 15 applications Recovery overhead = 5 ms k = 2 faults k = 3 faults SE SE CS CS k = 4 faults SE CS 20 processes 0.01 0.07 0.02 0.28 0.04 1.37 30 processes 0.13 0.39 0.19 2.93 0.26 31.50 40 processes 0.69s 0.32 1.34 0.50 17.02 0.69318.88s 318.88 Schedule length estimation (SE) is How faster is schedule length estimation (SE) more than 400 times faster than compared to conditional scheduling (CS)? conditional scheduling (CS) 50 50 of 14 Experimental Results 4 computation nodes 15 applications 25% of processes and 50% of messages are frozen Recovery overhead = 5 ms k = 2 faults k = 3 faults k = 4 faults 20 processes 32.89% 32.20% 30 processes 35.62% 31.68% 31.68% 30.58% 40 processes 28.88% 28.11% 30.56% 28.03% Schedule length of fault-tolerant How much is applications the improvement when is 31.68% transparency is taken into account? shorter on average if transparency was considered during mapping 51 51 of 14 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work 52 52 of 14 Conclusions Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency requirements, trading-off transparency for performance Fast scheduling alternative with low memory requirements for schedules 53 53 of 14 Conclusions Design optimization with fault tolerance Policy assignment optimization strategy Estimation-driven mapping optimization that can handle customized transparency requirements Optimization of the number of checkpoints Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller 54 54 of 14 Design Optimization of Embedded Systems with Fault Tolerance is Essential 55 55 of 14 Future Work Some More… Fault-Tree Analysis Probabilistic Fault Model Soft Real-Time 56 56 of 14