Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov

advertisement
Presentation of Licentiate Thesis
Scheduling and Optimization of FaultTolerant Embedded Systems
Viacheslav Izosimov
Embedded Systems Lab (ESLAB)
Linköping University, Sweden
1 of 14 1
Motivation
 Hard real-time applications




Time-constrained
Cost-constrained
Fault-tolerant
etc.
 Focus on transient faults and
intermittent faults
2 of 14 2
Motivation
Transient faults
 Happen for a short time
 Corruptions of data,
miscalculation in logic
 Do not cause a permanent
damage of circuits Electromagnetic interference (EMI)
 Causes are outside system
boundaries
Radiation
Lightning storms
3 of 14 3
Motivation
Intermittent faults
Transient faults
 Manifest similar as transient
faults
 Happen repeatedly
 Causes are inside system
Crosstalk
boundaries
Internal EMI
Init (Data)
Power supply fluctuations
Software errors (Heisenbugs)
4 of 14 4
Motivation
Transient faults are more likely to occur
as the size of transistors is shrinking
and the frequency is growing
Errors caused by transient faults have
to be tolerated before they crash the system
However, fault tolerance against
transient faults leads to significant
performance overhead
5 of 14 5
Motivation
 Hard real-time applications




Time-constrained
Cost-constrained
Fault-tolerant
etc.
The Need for Design Optimization of
Embedded Systems with Fault Tolerance
6 of 14 6
Outline
 Motivation
Background and limitations of previous work
 Thesis contributions:
 Scheduling with fault tolerance requirements
 Fault tolerance policy assignment
 Checkpoint optimization
 Trading-off transparency for performance
 Mapping optimization with transparency
 Conclusions and future work
7 of 14 7
General Design Flow
System Specification
Architecture Selection
Mapping & Hardware /
Software Partitioning
Fault Tolerance
Techniques
Feedback loops
Scheduling
Back-end Synthesis
8 of 14 8
Fault Tolerance Techniques
Re-execution
Error-detection overhead 
N1
N1
PP1/1
1
P1/2
Recovery overhead 
Rollback recovery with checkpointing
Checkpointing overhead 
N1
1
P1
0
2
P1
P1
20
N1
1
P1
40
1
2
60
PP1/1
1
2
P1/2
Active replication
N1
P1(1)
N1
P1(1)
N2
P1(2)
N2
P1(2)
9 of 14 9
Limitations of Previous Work
 Design optimization with fault tolerance is limited
 Process mapping is not considered together with
fault tolerance issues
 Multiple faults are not addressed in the framework
of static cyclic scheduling
 Transparency, if at all addressed, is restricted to a
whole computation node
10
10 of 14
Outline
 Motivation
 Background and limitations of previous work
Thesis contributions:
 Scheduling with fault tolerance requirements
 Fault tolerance policy assignment
 Checkpoint optimization
 Trading-off transparency for performance
 Mapping optimization with transparency
 Conclusions and future work
11
11 of 14
Fault-Tolerant Time-Triggered Systems
Transient faults
Processes: Re-execution,
Active Replication, Rollback
Recovery with Checkpointing
…
P1
m1
Messages: Fault-tolerant
predictable protocol
m2
P5
P2
P3
P4
Maximum k transient faults within each application run
(system period)
12
12 of 14
Scheduling with Fault Tolerance Reqirements
Conditional Scheduling
Shifting-based Scheduling
13
13 of 14
Conditional Scheduling
k=2
P1
P1
m1
0
20
40
60
80
100 120 140 160 180 200
P2
true
P1 0
P2
14
14 of 14
Conditional Scheduling
k=2
P1
P1
m1
0
20
P2
40
60
80
100 120 140 160 180 200
P2
true
FP1
P1 0
P2
40
15
15 of 14
Conditional Scheduling
k=2
P1
PP1/1
1
m1
0
20
P1/2
40
60
80
100 120 140 160 180 200
P2
true
FP1 FP1
45
P1 0
P2
40
16
16 of 14
Conditional Scheduling
k=2
P1
P1/1
m1
0
20
P1/2
40
60
P1/3
80
P2
100 120 140 160 180 200
P2
true
FP1 FP1 FP1 Ù FP1
45
P1 0
P2
40
90
130
17
17 of 14
Conditional Scheduling
k=2
P1
P1/1
m1
0
20
P1/2
40
60
PP2/1
2
80
P2/2
100 120 140 160 180 200
P2
true
FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2
45
P1 0
P2
40
90
130
85
140
18
18 of 14
Conditional Scheduling
k=2
P1
P1
m1
0
20
PP2/1
2
40
60
P2/2
80
P2/3
100 120 140 160 180 200
P2
true
FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2 FP1 Ù FP2 FP Ù FP Ù FP
1
45
P1 0
P2
40
2
2
90
130
85
140
95
150
19
19 of 14
Fault-Tolerance Conditional Process Graph
1
P1
1
k=2
P1
m1
P2
FP1
1
FP1
P2
m11
2
P1
m12
2
2
FP2
P2
4
P2
2
FP2
1
3
P1
3
m1
FP4
2
3
P2
P25
6
P2
Conditional Scheduling
20
20 of 14
Conditional Schedule Table
P1
N1
m1
k=2
N2
P2
true
P1
FP1 FP1 FP1 Ù FP1 FP1 Ù FP1 FP1 Ù FP1 Ù FP2 FP1 Ù FP2 FP Ù FP Ù FP
1
45
0
2
2
90
m1
40
130
85
P2
50
140
95
150
105
160
21
21 of 14
Conditional Scheduling
 Conditional scheduling:
 Generates short schedules
 Allows to trade-off between transparency and
performance (to be discussed later...)
– Requires a lot of memory to store schedule tables
– Scheduling algorithm is very slow
 Alternative: shifting-based scheduling
22
22 of 14
Shifting-based Scheduling
 Messages sent over the bus should be scheduled
at one time
 Faults on one computation node must not affect
other computation nodes
 Requires less memory
 Schedule generation is very fast
– Schedules are longer
– Does not allow to trade-off between transparency and
performance (to be discussed later...)
23
23 of 14
Ordered FT-CPG
1
P1
k=2
P2 after P1
P1
P2 m1
m3
m2
1
P1
4
2
P2
P2
3
P2
P2
P4
P3
2
P2
5
S
S
m2
m1
P3 after P4
3
P1
6
P2
S
m3
1
P4
1
P3
2
P3
3
P3
2
P4
4
P3
5
P3
3
P4
5
P3
24
24 of 14
Root Schedules
Recovery slack for P1 and P2
Bus
P2
P1
P1
Worst-case scenario for P1
P4
P3
m3
N2
P1
m1
m2
N1
25
25 of 14
Extracting Execution Scenarios
P1
P2
N2
m1
m2
Bus
P4/1
P4/2
P4/3
P3
m3
N1
26
26 of 14
Memory Required to Store Schedule Tables
20 proc.
40 proc.
60 proc.
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2
k=3 k=1
80 proc.
k=2
k=3
1.73 0.71 2.09 4.35 1.18 4.21 8.75
100% 0.13 0.28 0.54 0.36 0.891.73
4.96 1.20 4.64 11.55 2.01 8.40 21.11
75% 0.22 0.57 1.37 0.62 2.064.96
8.09 1.53 7.09 18.28 2.59 12.21 34.46
50% 0.28 0.82 1.94 0.82 3.118.09
12.56 1.92 10.00 28.31 3.05 17.30 51.30
25% 0.34 1.17 2.95 1.03 4.34
12.56
16.72 2.16 11.72 34.62 3.41 19.28 61.85
0% 0.39 1.42 3.74 1.17 5.61
16.72

Applications with more frozen nodes
require less memory
27
27 of 14
Memory Required to Store Root Schedule
20 proc.
40 proc.
60 proc.
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2
100%
0.016
0.034
0.03
0.054
k=3 k=1
80 proc.
k=2
k=3
0.070
1.73

Shifting-based scheduling requires very little memory
28
28 of 14
Schedule Generation Time and Quality
Shifting-based scheduling requires 0.2 seconds
to generate a root schedule for application of 120
processes and 10 faults
Conditional scheduling already takes 319 seconds
to generate a schedule table for application of 40
processes and 4 faults

Shifting-based scheduling much faster than
conditional scheduling
~15% worse than conditional scheduling with
100% inter-processor messages set to frozen
(in terms of fault tolerance overhead)
29
29 of 14
Fault Tolerance Policy Assignment
Checkpoint Optimization
30
30 of 14
Fault Tolerance Policy Assignment
2
N1 P1(1)
N1 P1(1)/1
N1 P1/1
P1/2
P1/3
P1(1)/2
N2 P1(2)
N2 P1(2)
N3 P1(3)
Re-execution
Replication
Re-executed
replicas
31
31 of 14
Re-execution vs. Replication
Deadline
N2
P2(1)
P1(2)
Missed
N1
N2
P3(2)
Re-execution is better
N1
P1
N2
P1(1)
P2
P2(1)
P1(2)
bus
m1(1)
m1(2)
bus
P2(2)
P3(1)
P3(1)
P2(2)
P3(2)
Met
m2(1)
m2(2)
P1(1)
m1(1)
m1(2)
N1
Deadline
Replication is better
Met
N1
P1
P2
P3
Missed
N2
P3
bus
bus
P1
m1
A1
P2
P3
N1
N2
N1 N2
P1 40 50
P2 40 50
P3 60 70
1
A2
P1
m1
P2
m2
P3
32
32 of 14
Fault Tolerance Policy Assignment
Deadline
m2
P1
m1
P1(2)
P4 PP4 3(1)
PP2(2)
3
P4(1)
P3 P3(2)
P4(2)
P3
m3
P2
MetMissed
Missed
Optimization
of fault tolerance
policy assignment
m3(1)
m3(2)
bus
P2 PP2(1)
2
m2
N2
PP1(1)
1
m
m2(1)
1(1)
m1(2)
m2(1)
m2(2)
N1
P4
P1
P2
P3
P4
N1
40
60
60
40
N2
50
80
80
50
1
N1
N2
33
33 of 14
Optimization Strategy


Design optimization:


Fault tolerance policy assignment
Mapping of processes and messages
Tabu-search

Root schedules
Shifting-based
scheduling
Three tabu-search optimization algorithms:
1. Mapping and Fault Tolerance Policy assignment (MRX)

Re-execution, replication or both
2. Mapping and only Re-Execution (MX)
3. Mapping and only Replication (MR)
34
34 of 14
Experimental Results
Schedulability improvement under resource constraints
Avgerage % deviation from MRX
100
90
Mapping and replication (MR)
80
70
60
50
40
30
Mapping and re-execution (MX)
20
10
0
Mapping and policy assignment (MRX)
20
40
60
80
100
Number of processes
35
35 of 14
Checkpoint Optimization
N1
P1
1
P1
1
P1
22
PP1/1
1
2
P1/2
2
P1
36
36 of 14
Locally Optimal Number of Checkpoints
No. of checkpoints
1
2
k=2
P1
1
2
P1
1
3
P1
4
P1
5
P1
1
1
1 = 5 ms
P1
2
3
P1
2
P1
2
P1
1 = 10 ms
P1
3
P1
3
P1
1 = 15 ms
4
P1
4
P1
5
P1
P1
C1 = 50 ms
37
37 of 14
Globally Optimal Number of Checkpoints
1
P11
P1
1
P2
2
P1 2
P1
2
P2
3
P1
1
P2
3
P2
2
P2
265
k=2
P1
m1
P2
P1
P2
255
  
10 5 10
10 5 10
P1
P2
C1 = 50 ms
C2=60 ms
38
38 of 14
Globally Optimal Number of Checkpoints
a)
b)
1
P1
2
P1
1
P1
3
P1
2
P1
1
P2
1
P2
k=2
P1
m1
P2
P1
P2
  
10 5 10
10 5 10
2
P2
3
P2
265
255
2
P2
P1
P2
C1 = 50 ms
C2=60 ms
39
39 of 14
Globally Optimal Number of Checkpoints
a)
b)
1
P1
2
P1
1
P1
3
P1
2
P1
1
P2
1
P2
k=2
P1
m1
P2
P1
P2
  
10 5 10
10 5 10
2
P2
3
P2
265
255
2
P2
P1
P2
C1 = 50 ms
C2=60 ms
40
40 of 14
% deviation from MC0
(how smaller the fault tolerance overhead)
Global Optimization vs. Local Optimization
40%
Does the optimization reduce
the fault tolerance overheads
on the schedule length?
30%
4 nodes, 3 faults
20%
Global Optimization of Checkpoint Distribution (MC)
10%
Local Optimization of Checkpoint Distribution (MC0)
0%
40
60
80
100
Application size (the number of tasks)
41
41 of 14
Trading-off Transparency for Performance
Mapping Optimization with
Transparency
42
42 of 14
FT Implementations with Transparency
– regular processes/messages
– frozen processes/messages
P1
m1
Frozen
m2
P5
P2
P3
P4
Transparency is achieved with frozen
processes and messages
Good for debugging and testing
43
43 of 14
No Transparency
processes start at different times
P1
N2
P4
m1
m2
bus
N1
no fault scenario
P2
P1
messages are sent at different times
P3
m3
N1
P1
the worst-case fault scenario
P2
N2
N2
m3
bus
m1
m2
P4
N1
Deadline
P1
P2
P3
P4
N1
30
20
X
X
P4
N2
X
X
20
30
P3
 = 5 ms
k=2
P2
m3
P3
m2
P1
m1
P4
44
44 of 14
Customized
Full Transparency
Transparency
Deadline
P4
PP44
P3
m1
m2
m1
m2
no fault scenario
No transparency
P2
P3
m3
P2 P1
m3
P1
P1
Deadline
Full transparency
P2
P1
P3
P3
Customized transparency
P2
P4
P3
P3
m3
P3
m2
m1
P1
P3
m3
m1
m2
P4
45
45 of 14
Trading-Off Transparency for Performance
increasing transparency
0%
25%
50%
75%
100%
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3
20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139
29 43 20 40
40 58 28 49
49 72 34 60
60 90 39 66
66 97
40 17 29
60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86
80
8

16
22
10
18
29
14
27
39
24
41
66
27
43
73
 How longer is the
Trading transparency
forlength
performance
Fourschedule
(4) computation
nodes
with is essential
Recovery
time 5 ms
fault
tolerance?
46
46 of 14
Mapping with Transparency
Deadline
N1
P1
bus
N1
P4/1
bus
P1
P5
optimal mapping
without transparency
P6
P4/2
P3
P4/3
P2
P5
the worst-case fault
scenario for optimal mapping
P6
m1
N2
P2
P3
m1
N2
P4
N1
N2
P1
P2
P3
P4
P5
P6
N1
30
40
50
60
40
50
N2
30
40
50
60
40
50
 = 10 ms
k=2
m1
P1
m2
P2
P3
m3
P5
P4
m4
P6
47
47 of 14
Mapping with Transparency
Deadline
N1
P1
N1
N2
bus
P3
P2/2
P2/3
P5
the worst-case fault scenario with
transparency for “optimal” mapping
P6
m1
bus
P2/1
P1
P2
P4/1
P3
the worst-case fault scenario with
transparency and optimized mapping
P5
P4/2
P4/3
P6
m2
N2
P4
N1
N2
P1
P2
P3
P4
P5
P6
N1
30
40
50
60
40
50
N2
30
40
50
60
40
50
 = 10 ms
k=2
m1
P1
m2
P2
P3
m3
P5
P4
m4
P6
48
48 of 14
Design Optimization
Hill-climbing mapping optimization heuristic
Schedule length
1. Conditional Scheduling (CS)
Slow
2. Schedule Length Estimation (SE)
Fast
49
49 of 14
Experimental Results
25% of processes and
50% of messages are frozen
4 nodes
15 applications
Recovery overhead
 = 5 ms
k = 2 faults
k = 3 faults
SE
SE
CS
CS
k = 4 faults
SE
CS
20 processes
0.01 0.07 0.02 0.28
0.04
1.37
30 processes
0.13 0.39 0.19 2.93
0.26
31.50
40 processes
0.69s
0.32 1.34 0.50 17.02
0.69318.88s
318.88

Schedule length estimation (SE) is
How faster is schedule length estimation (SE)
more than 400 times faster than
compared to conditional scheduling (CS)?
conditional scheduling (CS)
50
50 of 14
Experimental Results
4 computation nodes
15 applications
25% of processes and
50% of messages are frozen
Recovery overhead  = 5 ms k = 2 faults k = 3 faults k = 4 faults
20 processes
32.89%
32.20%
30 processes
35.62% 31.68%
31.68% 30.58%
40 processes
28.88%
28.11%
30.56%
28.03%
Schedule length of
 fault-tolerant
How much is applications
the improvement
when
is 31.68%
transparency
is taken
into account?
shorter on average
if transparency
was considered during mapping
51
51 of 14
Outline
 Motivation
 Background and limitations of previous work
 Thesis contributions:
 Scheduling with fault tolerance requirements
 Fault tolerance policy assignment
 Checkpoint optimization
 Trading-off transparency for performance
 Mapping optimization with transparency
Conclusions and future work
52
52 of 14
Conclusions
 Scheduling with fault tolerance
requirements
 Two novel scheduling techniques
 Handling customized transparency
requirements, trading-off transparency for
performance
 Fast scheduling alternative with low memory
requirements for schedules
53
53 of 14
Conclusions
 Design optimization with fault tolerance
 Policy assignment optimization strategy
 Estimation-driven mapping optimization that can
handle customized transparency requirements
 Optimization of the number of checkpoints
Approaches and algorithms have been evaluated
on the large number of synthetic applications and
a real life example – vehicle cruise controller
54
54 of 14
Design Optimization of Embedded
Systems with Fault Tolerance is
Essential
55
55 of 14
Future Work
Some More…
Fault-Tree
Analysis
Probabilistic
Fault Model
Soft Real-Time
56
56 of 14
Download