Cellular Phones as Embedded Systems

advertisement
Pipelines for Future Architectures in
Time Critical Embedded Systems
By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand
EEL 6935 - Embedded Systems
Dept. of Electrical and Computer Engineering
University of Florida
Liza Rodriguez
Aurelio Morales
Outline
• Pipelining Review
• Timing Analysis
•Anomalies
•Domino Effects
• Architecture Classifications
• Conclusions
2 of 23
Outline
• Pipelining Review
• Timing Analysis
•Anomalies
•Domino Effects
• Architecture Classifications
• Conclusions
3 of 23
Pipelining Review
• Pipelining is an implementation technique where
multiple instructions are overlapped in execution
• Pipelining takes advantage of parallelism that exists
among the actions needed to execute and instruction
• Pipelining is like an assembly line, each stage operates in parallel with
the other stages
• Instructions enter at one end, progress through the stages, and exit at
the other end
• Pipelining is the key implementation technique used to
make fast CPUs
4 of 23
Pipelined Example
LD
ADD
ADD
• Pipeline registers separate functional
units to allow parallel operation
r4, 0(r3)
r1, r7, r3
r2, r6, r30
• Pipeline will stall if there is a hazard
Fetch
Decode
Execute
Memory
Write Back
001100
101011
LOAD
ADD
r6
r7
0 ++ r3
r3
XXX
read
r4
r2
r1
LD
ADD
ADD
r4, 0(r3)
r1, r7, r3
r2, r6, r30
5 cycles (5)
1 cycles (4)
1 cycles (4)
5 of 23
Further Optimizations
• Superscalar – executes more than one instruction per
clock cycle by simultaneously dispatching multiple
instructions to redundant functional units
Fetch
Decode
Execute
Memory
Write Back
Fetch
Decode
Execute
Memory
Write Back
• Branch Prediction – predict branches based on a predefined
static algorithm or based on dynamic branch history
• Out of order execution – instructions are dynamically
scheduled to avoid hazards
ADD r1, r2, r3 wait
LD r4, (0) r5
wait
SUB r1, r2, r3 wait
ST r2, (0) r1
ready
and dependencies that may MUL r6, r7, r8 ready LD r4, (0) r1 wait
Execute
Memory
stall the pipeline
Reservation
Stations
Functional
Units
6 of 23
Outline
• Pipelining Review
• Timing Analysis
•Anomalies
•Domino Effects
• Architecture Classifications
• Conclusions
7 of 23
Real Time Embedded Systems
• Timing Analysis
• The analysis for a set of tasks executing on a given hardware to
guarantee that timing constraints will be met
• Timing requires upper and lower bounds on execution times of tasks
to be known:
• Worst Case Execution Time (WCET), Best Case Execution Time (BCET)
• Analysis results are highly dependent on the architecture
• An architecture without accompanying performance analysis
technology should not be seriously considered for time critical
embedded applications
• Desired Criteria
• Soundness – valid, reliable, free from random error
• Obtainable Precision – architecture has predictability properties
• Analysis effort to reach precision – depends on solution space to be
8 of 23
explored
Timing Analysis
• Non-Pipelined Architecture – Simple
• Add the execution times of individual instructions to obtain a bound
on the execution time of a basic block
• Pipelined Architecture – Complex
• Overlapped instructions - cannot consider individual instructions in
isolation
• Instructions must be considered collectively to obtain timing bounds
9 of 23
Timing Analysis
• Pipelined Architecture – Complex
• To do WCET analysis, the most costly pipeline path should be selected
• To compute a precise bound, the analysis needs to include as many
“timing accidents” as possible
• Timing accidents: data hazards, branch mispredictions, occupied
functional units, cache misses, etc.
• Issues: timing anomalies and domino effects
• Thus, timing has to follow all possible successor states
• The more performance enhancing features the pipeline has, the larger
the search space
10 of 23
Timing Anomaly
• Formal definition - a situation where the local worst
case does not contribute to the global worst case
• A better definition – a positive improvement to the
architecture that has a negative effect on execution
time
• Examples:
• A caches miss may result in a shorter execution time
• Shortening an instruction leads to longer execution time
11 of 23
Timing Anomaly Example: Cache Hit or Miss
A
B
1
C
2
D
3
E
4
5
6
7
8
9
10
11
12
13
A
LSU
B C
ALU
D
MULT
B
1
C
2
D
3
5
6
7
r4, 0(r3)
r5, r4, r4
r1, r6, r6
r2, r1, r1
r3, r2, r2
• Miss Penalty
LSU
ALU
Multiplier
8 cyc.
2 cyc.
1 cyc.
4 cyc.
8
9
10
11
12
13
A
LSU
ALU
E
4
LD
ADD
ADD
MUL
MUL
E
Cache Hit
A
• A
B
C
D
E
C
B
D
MULT
Cache Miss
E
• Architecture is made up
of functional units and
reservation stations –
similar to Tomasulo’s
Algorithm
12 of 23
Timing Anomaly Example: Reduced Instruction
A
B
1
C
2
D
3
E
4
5
6
LSU
7
8
9
10
11
12
13
D
B
C
ALU
E
MUL
ADD
ADD
LD
ADD
r2, r1, r1
r3, r2, r2
r4, r5, r5
r6, 0(r4)
r7, r6, r6
• Miss Penalty
LSU
ALU
Multiplier
8 cyc.
4 cyc.
2 cyc.
? cyc.
A
MULT
Multiplier = 5 cycles
A
B
1
C
2
D
3
E
4
5
6
7
8
9
10
11
12
13
D
LSU
B
ALU
MULT
• A
B
C
D
E
A
Multiplier = 2 cycles
C
E
• Architecture is made up
of functional units and
reservation stations –
similar to Tomasulo’s
Algorithm
13 of 23
Domino Effects
• Formal definition – a system exhibits a domino effect
if there are two hardware states s, t such that the
difference in execution time may be arbitrarily high
and cannot be bounded by a constant
• A better definition – a minor timing accident can
cause an unbounded increase in execution time
• Examples:
• Timing accident in a loop
• PowerPC755 pipeline – Schneider
• Pseudo-least-recently used (PLRU) replacement policy – Berg
14 of 23
Domino Effects
A
B
1
2
3
4
B
6
B
1
7
8
9
A
10
2
3
4
A
ADD r4, r3, r3
ADD r1, r2, r2
6
12
B
7
13
14
A
15
A B
A
5
B
11
A B
A
• A
B
5
A
ALU
ALU
A
8
9
10
B A
• A Dispatch
A Execute
B Dispatch
B Execute
16
12
18
19
20
19
20
A B
A
11
17
B
13
14
B A
EA +5
Immdt
DA+4
DA+6
15
16
A
17
18
B A
• First A gets delayed one
clock cycle due to a
dependency with the
previous instruction
15 of 23
Outline
• Pipelining Review
• Timing Analysis
•Anomalies
•Domino Effects
• Architecture Classifications
• Conclusions
16 of 23
Classification of Architectures
• Fully Timing Compositional Architectures
• No timing anomalies or domino effects
• Timing analysis can safely follow worst case paths only
• Example: ARM7
• Compositional Architectures with Constant Bounded Effects
• Exhibit timing anomalies but no domino effects
• Timing analysis has to consider all paths but can be optimized to safely discard
all local non-worst case paths by adding a constant number of cycles to the
worst case path – trading precision with efficiency
• Example: Infineon TriCore
• Non Compositional Architectures
• Exhibit timing anomalies and domino effects
• Timing analysis has to follow all possible paths since a local effect can greatly
influence the future execution arbitrarily
• Example: PowerPC775
17 of 23
Outline
• Pipelining Review
• Timing Analysis
•Anomalies
•Domino Effects
• Architecture Classifications
• Conclusions
18 of 23
Conclusions
• Architectural optimizations in embedded systems are
necessary to improve performance and to meet critical
time constraints
• Pipelines - multiple issue, out of order execution, branch prediction, etc.
• However, an architectural optimization may not be worth
implementing if effects such as timing anomalies and
domino will have a negative impact on timing analysis
• How good is an optimization if you can’t measure its effects?
• A trade off exists between the amount of executions time
you can save by pipeline optimizations and the amount of
precision you lose in timing analysis
19 of 23
Questions?
20 of 23
Download