Reliability-Aware Energy Optimisation for Fault-Tolerant Embedded MP-SoCs Summary Fault-tolerance Faults are tolerated by using temporal or spatial redundancy, or a combination of the two Fault detection is done using well known techniques such as: timing and bit coding Design optimisation tool for distributed embedded real-time systems Decides mapping, fault-tolerance policy and fault-tolerant schedule PE1 P1 PE2 P1 Replication Deadline Fully Transparent Scheduling PE1 Hard real-time, Hard reliability goal, PE2 Static schedule for processes and messages, Bus Comparison of FT schemes P1 P1 P2 P2 P3 P4 P4 P5 P3 2 1 100% E0 Fault-tolerance for k transient/soft faults Optimise for minimal energy consumption P5 Slack Sharing Scheduling PE1 While considering impact of lowering voltages on the probability of faults P1 PE2 P4/5 P5 P3 P3 2 1 Bus Constraint logic programming (CLP) based implementation P4 P2 63% E0 Conditional Scheduling PE1 P1 P1 Fault-tolerant scheduling PE2 Re-execution PE1 P1 PE2 P1 Energy vs. Faults Recent research shows that the probability of transient/soft faults increases dramatically when decreasing the voltage of a circuit Many modern designs uses dynamic voltage scaling (DVS) to minimise energy consumption Fault-tolerant systems that use power management techniques may prove to be fault-tolerant but unreliable due to increase in faults Relation between faults and voltage is given by1: 1 2 PE1 PE m1 Performance, and hence the obtainable energy savings are greatly increased P2 P3 P4 2 PE1 P2 40 40 P3 30 30 m2 P4 40 40 k=1 P5 40 40 P5 Energy vs. reliability Deadline Straightforward (SS) PE1 The use of DVS increases the probability of faults, thus damaging the system reliability P1 PP2 2 PE2 P P4 PP33 PP55 P6 1 Bus 2 R=0.999 999 987 100% E0 Energy optimisation (EO) Considering reliability in the optimisation process allows for finding the minimum energy schedule that meets the reliability goal PE1 Reliability is imposed as a constraint P22 P P1 PE2 P5 P4 P3 P6 1 Bus 2 R=0.999 999 878 68% E0 Reliable energy optimisation (REO) Considering the reliability while optimising enables us to find reliable schedules with comparable energy savings RG imposed as hard constraint PE1 d 1− f 1− f min P1 P2 P4 PE2 Bus P3 P5 1 P6 2 R=0.999 999 920 PE1 PE P1 P2 m1 P5 m2 P1 10 X PE1 Voltage levels PE1 100 66 33 PE 2 100 66 33 P4 40 X P6 P5 X 40 P6 X 50 1 Comparison of energy savings PE2 P3 X 40 Relation between voltage and failure rate k=1 73% E0 2 P2 60 X P3 P4 D. Zhu et al.: “Reliability-Aware Energy Management for Periodic Real-Time Tasks”, 2007 PE2 P1 40 40 Reliability can be met at very little energy cost λ f = λ 0 10 Trade-off transparency for performance Reliability must be considered in the optimisation process P3 P1 System reliability is affected by use of energy management P5 38% E0 More complex schemes demand larger schedule tables to be stored in the processing elements, and more sophisticated online schedulers P4 P2 Bus Reliable energy management 1 P1 PE2 More complex scheduling schemes yield more slack for energy management Passive Replication PE1 k=1 Comparison of system reliability IMM / Informatics and Mathematical Modelling eselab.imm.dtu.dk k=1 RG=0.999 999 900