Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL 6935 - Embedded Systems Dept. of Electrical and Computer Engineering University of Florida Liza Rodriguez Aurelio Morales Outline • Pipelining Review • Timing Analysis •Anomalies •Domino Effects • Architecture Classifications • Conclusions 2 of 23 Outline • Pipelining Review • Timing Analysis •Anomalies •Domino Effects • Architecture Classifications • Conclusions 3 of 23 Pipelining Review • Pipelining is an implementation technique where multiple instructions are overlapped in execution • Pipelining takes advantage of parallelism that exists among the actions needed to execute and instruction • Pipelining is like an assembly line, each stage operates in parallel with the other stages • Instructions enter at one end, progress through the stages, and exit at the other end • Pipelining is the key implementation technique used to make fast CPUs 4 of 23 Pipelined Example LD ADD ADD • Pipeline registers separate functional units to allow parallel operation r4, 0(r3) r1, r7, r3 r2, r6, r30 • Pipeline will stall if there is a hazard Fetch Decode Execute Memory Write Back 001100 101011 LOAD ADD r6 r7 0 ++ r3 r3 XXX read r4 r2 r1 LD ADD ADD r4, 0(r3) r1, r7, r3 r2, r6, r30 5 cycles (5) 1 cycles (4) 1 cycles (4) 5 of 23 Further Optimizations • Superscalar – executes more than one instruction per clock cycle by simultaneously dispatching multiple instructions to redundant functional units Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back • Branch Prediction – predict branches based on a predefined static algorithm or based on dynamic branch history • Out of order execution – instructions are dynamically scheduled to avoid hazards ADD r1, r2, r3 wait LD r4, (0) r5 wait SUB r1, r2, r3 wait ST r2, (0) r1 ready and dependencies that may MUL r6, r7, r8 ready LD r4, (0) r1 wait Execute Memory stall the pipeline Reservation Stations Functional Units 6 of 23 Outline • Pipelining Review • Timing Analysis •Anomalies •Domino Effects • Architecture Classifications • Conclusions 7 of 23 Real Time Embedded Systems • Timing Analysis • The analysis for a set of tasks executing on a given hardware to guarantee that timing constraints will be met • Timing requires upper and lower bounds on execution times of tasks to be known: • Worst Case Execution Time (WCET), Best Case Execution Time (BCET) • Analysis results are highly dependent on the architecture • An architecture without accompanying performance analysis technology should not be seriously considered for time critical embedded applications • Desired Criteria • Soundness – valid, reliable, free from random error • Obtainable Precision – architecture has predictability properties • Analysis effort to reach precision – depends on solution space to be 8 of 23 explored Timing Analysis • Non-Pipelined Architecture – Simple • Add the execution times of individual instructions to obtain a bound on the execution time of a basic block • Pipelined Architecture – Complex • Overlapped instructions - cannot consider individual instructions in isolation • Instructions must be considered collectively to obtain timing bounds 9 of 23 Timing Analysis • Pipelined Architecture – Complex • To do WCET analysis, the most costly pipeline path should be selected • To compute a precise bound, the analysis needs to include as many “timing accidents” as possible • Timing accidents: data hazards, branch mispredictions, occupied functional units, cache misses, etc. • Issues: timing anomalies and domino effects • Thus, timing has to follow all possible successor states • The more performance enhancing features the pipeline has, the larger the search space 10 of 23 Timing Anomaly • Formal definition - a situation where the local worst case does not contribute to the global worst case • A better definition – a positive improvement to the architecture that has a negative effect on execution time • Examples: • A caches miss may result in a shorter execution time • Shortening an instruction leads to longer execution time 11 of 23 Timing Anomaly Example: Cache Hit or Miss A B 1 C 2 D 3 E 4 5 6 7 8 9 10 11 12 13 A LSU B C ALU D MULT B 1 C 2 D 3 5 6 7 r4, 0(r3) r5, r4, r4 r1, r6, r6 r2, r1, r1 r3, r2, r2 • Miss Penalty LSU ALU Multiplier 8 cyc. 2 cyc. 1 cyc. 4 cyc. 8 9 10 11 12 13 A LSU ALU E 4 LD ADD ADD MUL MUL E Cache Hit A • A B C D E C B D MULT Cache Miss E • Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm 12 of 23 Timing Anomaly Example: Reduced Instruction A B 1 C 2 D 3 E 4 5 6 LSU 7 8 9 10 11 12 13 D B C ALU E MUL ADD ADD LD ADD r2, r1, r1 r3, r2, r2 r4, r5, r5 r6, 0(r4) r7, r6, r6 • Miss Penalty LSU ALU Multiplier 8 cyc. 4 cyc. 2 cyc. ? cyc. A MULT Multiplier = 5 cycles A B 1 C 2 D 3 E 4 5 6 7 8 9 10 11 12 13 D LSU B ALU MULT • A B C D E A Multiplier = 2 cycles C E • Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm 13 of 23 Domino Effects • Formal definition – a system exhibits a domino effect if there are two hardware states s, t such that the difference in execution time may be arbitrarily high and cannot be bounded by a constant • A better definition – a minor timing accident can cause an unbounded increase in execution time • Examples: • Timing accident in a loop • PowerPC755 pipeline – Schneider • Pseudo-least-recently used (PLRU) replacement policy – Berg 14 of 23 Domino Effects A B 1 2 3 4 B 6 B 1 7 8 9 A 10 2 3 4 A ADD r4, r3, r3 ADD r1, r2, r2 6 12 B 7 13 14 A 15 A B A 5 B 11 A B A • A B 5 A ALU ALU A 8 9 10 B A • A Dispatch A Execute B Dispatch B Execute 16 12 18 19 20 19 20 A B A 11 17 B 13 14 B A EA +5 Immdt DA+4 DA+6 15 16 A 17 18 B A • First A gets delayed one clock cycle due to a dependency with the previous instruction 15 of 23 Outline • Pipelining Review • Timing Analysis •Anomalies •Domino Effects • Architecture Classifications • Conclusions 16 of 23 Classification of Architectures • Fully Timing Compositional Architectures • No timing anomalies or domino effects • Timing analysis can safely follow worst case paths only • Example: ARM7 • Compositional Architectures with Constant Bounded Effects • Exhibit timing anomalies but no domino effects • Timing analysis has to consider all paths but can be optimized to safely discard all local non-worst case paths by adding a constant number of cycles to the worst case path – trading precision with efficiency • Example: Infineon TriCore • Non Compositional Architectures • Exhibit timing anomalies and domino effects • Timing analysis has to follow all possible paths since a local effect can greatly influence the future execution arbitrarily • Example: PowerPC775 17 of 23 Outline • Pipelining Review • Timing Analysis •Anomalies •Domino Effects • Architecture Classifications • Conclusions 18 of 23 Conclusions • Architectural optimizations in embedded systems are necessary to improve performance and to meet critical time constraints • Pipelines - multiple issue, out of order execution, branch prediction, etc. • However, an architectural optimization may not be worth implementing if effects such as timing anomalies and domino will have a negative impact on timing analysis • How good is an optimization if you can’t measure its effects? • A trade off exists between the amount of executions time you can save by pipeline optimizations and the amount of precision you lose in timing analysis 19 of 23 Questions? 20 of 23