Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems Viaceslav Izosimov, Paul Pop, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping University, Sweden 1 of 1/14 14 Motivation Hard real-time applications Timing constraints Cost constraints Hardware solutions MARS, TTA, X-by-Wire Permanent faults Costly for transient faults Online preemptive Flexible Faults Predictable Transient Intermittent vs. Software Software solutions solutions Re-execution/rollback recovery Checkpointing/rollback recovery Replication, primary-backup… vs. Off-line non-preemptive Predictable 2 of 2/14 14 Outline Motivation System architecture and fault-model Fault-tolerance techniques Problem formulation Motivational examples Tabu-search optimization strategy Experimental results Contributions and Message 3 of 3/14 14 Fault-Tolerant Time-Triggered Systems Transient faults ... Processes: Re-execution Static cyclicand scheduling replication Messages: Fault-tolerant Static schedule protocol table Time Triggered Protocol (TTP) Bus access scheme: time-division multiple-access (TDMA) Schedule table located in each TTP controller: message descriptor list (MEDL) S1 S3 Slot S2 S4 S1 S3 S2 S4 TDMA Round Cycle of two rounds 4 of 4/14 14 Fault-Tolerant Techniques 2 N1 N1 P1 P1 P1 N2 N3 Re-execution P1 N1 P1 N2 P1 P1 P1 P1 Replication Re-executed replicas 5 of 5/14 14 Problem Formulation Given Fault model Number of transient faults in the system period System architecture Application WCETs, message sizes, periods, deadlines Determine Fault-model: transient faults Schedulable and fault-tolerant design implementation ... Fault-tolerance policy assignment Mapping of processes and messages Schedule tables for processes and messages Application: set of process graphs Architecture: time-triggered system 6 of 6/14 14 Static Scheduling [Kandasamy et al. 03] Contingency schedules Transparent re-execution P2 N1: S21 P1 N2: S12 11 P23 P3 P4 Recovery slack P4 m1 P1 m2 N3: S14 P5 2 Root schedules P2 Contingency schedules N1 N2 N3 S1 S11 S14 P3 S2 P2 S3 P3 S4 P4 S6 P4 S5 P3 S7 P5 P1 S9 P4 P4 S8 S10 S12 P1 S13 S15 N1 N2 N3 P5 S18 P1 m1 m2 P5 P2 P3 P4 7 of 7/14 14 Re-execution vs. Replication Deadline P1 TTP S S 1 2 P2 Missed P3 N1 N2 P3 P1 P1 Re-execution is better N1 P1 N2 P2 Met P3 Met N1 P1 P2 P3 Missed TTP S1S2 P1 P2 P2 N2 TTP S1S2 m1 P3 Replication is better P3 A1 P2 TTP S1S2 m1 m1 N2 P2 m2 m2 P1 m1 m1 N1 Deadline P3 N1 N2 N1 N2 P1 40 50 P2 40 50 P3 60 70 1 A2 P1 m1 P2 m2 P3 8 of 8/14 14 Fault-Tolerant Policy Assignment Deadline P1 P2 P2 P4P4 P3 P2 P2 P1 TTP S11S22 P4 P4P3 P32 m12 m1 m2 m2 N22 P11 m2 P3 P1 m1 P3 P2 m3 P4 P3 MetMissed P4 Missed Optimization of fault-tolerance policy assignment m3 m3 TTP S1S2 N11 No fault-tolerance: application crashes m2 N2 m2 N1 P4 P1 P2 P3 P4 N1 N2 40 50 60 80 60 80 40 50 1 N1 N2 9 of 9/14 14 Mapping and Fault-Tolerance P1 P2 P3 N1 m2 TTP S1S2 P1 Simultaneous mapping and fault-tolerance Deadline m4 N2 Best mapping without considering fault-tolerance P4 P2 P3 N2 P4 P4 P3 Missed m2 TTP S1S2 m1 P 1 m2 P2 P3 m3 m4 P4 Met m4 N1 P1 P2 P3 P4 N1 40 60 60 40 N2 X 70 70 X 1 N1 N2 10 of10/14 14 Optimization Strategy Design optimization: Fault-tolerance policy assignment Mapping of processes and messages Tabu-search Root schedules List scheduling Three tabu-search optimization algorithms: 1. Mapping and Fault-Tolerance Policy assignment (MRX) Re-execution, replication or both 2. Mapping and only Re-Execution (MX) 3. Mapping and only Replication (MR) 11 of11/14 14 MRX Tabu-Search Example P2 P2 P1 TTP S1S2 P4 P4 P3 P3 P1 Tabu 2 1 Wait 0 1 P2 P3 P4 1 2 0 0 0 2 1 1 P1 Tabu 12 Wait 10 P2 P3 P4 21 0 0 0 1 2 1 Current solution S2 m2 N2 P1 m2 m1 N1 Design Design transformations transformations P2 P2 P1 TTP S1S2 P4P3P4P2 P3P4 P4 PP33 P3 Non-tabu Tabu move&&& Non-tabu worse better than worse than best-so-far best-so-far S2S1 m22 m1 N2 P1 m2 m1 N1 m2 P3 P1 m1 P2 m3 P4 P1 P2 P3 P4 N1 N2 40 50 60 75 60 75 40 50 1 N1 N2 12 of12/14 14 Experimental Results Schedulability improvement under resource constraints Avgerage % deviation from MRX 100 90 Mapping and replication (MR) 80 70 Case study 60 Vehicle cruise controller MRX: schedulable fault-tolerant application with 65% overhead 50 40 30 Mapping and re-execution (MX) 20 10 0 Mapping and policy assignment (MRX) 20 40 60 80 100 Number of processes 13 of13/14 14 Contributions and Message Contributions Combined re-execution and replication Optimization algorithms for fault-tolerance policy assignment Efficient contingency schedule generation Optimization of fault-tolerance policy assignment needed for cost-effective fault tolerance 14 of14/14 14