3.13. Fallacies and Pitfalls • Fallacy: Processors with lower CPIs will always be faster • Fallacy: Processors with faster clock rates will always be faster – Balance must be found: • E.g. sophisticated pipeline: CPI ↓ clock cycle ↑ Fallacies and Pitfalls • Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance – Again, question of balance • SuperSPARC –vs– HP PA 7100 – Complex interactions between cycle time and organisation Fallacies and Pitfalls • Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement – Amdahl’s Law! – Boosting performance of one area may uncover problems in another Fallacies and Pitfalls • Pitfall: Sometimes bigger and dumber is better! – Alpha 21264: sophisticated multilevel tournament branch predictor – Alpha 21164: simple two-bit predictor – 21164 performs better for transaction processing application! • Can handle twice as many local branch predictions Concluding Remarks • Lots of open questions! – Clock speed –vs– CPI – Power issues – Exploiting parallelism • ILP –vs– explicit Characteristics of Modern (2001) Processors • Figure 3.61 – – – – – – 3–4 way superscalar 4–22 stage pipelines Branch prediction Register renaming (except UltraSPARC) 400MHz – 1.7GHz 7–130 million transistors Chapter 4 Exploiting ILP with Software 4.1. Compiler Techniques for Exposing ILP • Compilers can improve the performance of simple pipelines – Reduce data hazards – Reduce control hazards Loop Unrolling • Compiler technique to increase ILP – Duplicate loop body – Decrease iterations • Example: – Basic code: 10 cycles per iteration – Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } Loop Unrolling for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; for x[k+1] (int k == x[k+1] 0; k < +1000; s; k++) { x[k+2] x[k] == x[k+2] x[k] + +s;s; } x[k+3] = x[k+3] + s; } • Basic code: 7 cycles per “iteration” • Scheduled: 3.5 cycles (no stalls!) Loop Unrolling • Requires clever compilers – Analysing data dependences, name dependences and control dependences • Limitations – – – – Code size Decrease in amortisation of overheads “Register pressure” Compiler limitations • Useful for any architecture Superscalar Performance • Two-issue MIPS (int + FP) • 2.4 cycles per “iteration” – Unrolled five times 4.2. Static Branch Prediction • Useful: – where behaviour can be predicted at compiletime – to assist dynamic prediction • Architectural support – Delayed branches Static Branch Prediction • Simple: – Predict taken – Has average misprediction rate of 34% (SPEC) – Range: 59% – 9% • Better: – Predict backward taken, forward not-taken – Worse for SPEC! Static Branch Prediction • Advanced compiler analysis can do better • Profiling is very useful – FP: 9% ± 4% – Int: 15% ± 5% 4.3. Static Multiple Issue: VLIW • Compiler groups instructions into “packets”, checking for dependences – Remove dependences – Flag dependences • Simplifies hardware VLIW • First machines used a wide instruction with multiple operations per instruction – Hence Very Long Instruction Word (VLIW) – 64–128 bits • Alternative: group several instructions into an issue packet VLIW Architectures • Multiple functional units • Compiler selects instructions for each unit to create one long instruction/an issue packet • Example: five operations – Integer/branch, 2 × FP, 2 × memory access • Need lots of parallelism – Use loop unrolling, or global scheduling Example for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } • Loop unrolled seven times! • 1.29 cycles per result • 60% of available instruction “slots” filled Summary of Improvements Technique Basic code Loop unrolled (4) Superscalar (5) VLIW (7) Unscheduled Scheduled 10 6 7 3.5 2.4 1.29 Drawbacks of Original VLIWs • Large code size – Need to use loop unrolling – Wasted space for unused slots • Clever encoding techniques, compression • Lock-step execution – Stalling one unit stalls them all • Binary code compatibility – Variations on structure required recompilation 4.4. Compiler Support for Exploiting ILP • We will not cover this section in detail • Loop unrolling – Loop-carried dependences • Software pipelining – Interleave instructions from different iterations 4.5. Hardware Support for Extracting More Parallelism • Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time • If not, we need more advanced techniques: – Conditional instructions – Hardware support for compiler speculation Conditional or Predicated Instructions • Instructions have associated conditions – If condition is true execution proceeds normally – If not, instruction becomes a no-op cmovz bnez %r8, %r8, %r1, L1 %r2 nop mov %r1, %r2 L1: ... • Removes control hazards if (a == 0) b = c; Conditional Instructions • Control hazards effectively replaced by data hazards • Can be used for speculation – Compiler reorders instructions depending on likely outcome of branches Limitations on Conditional Instructions • Annulled instructions still execute – But may occupy otherwise stalled time • Most useful when conditions evaluated early • Limited usefulness for complex conditions • May be slower than unconditional operations Conditional Instructions in Practice Machine Conditional Instructions MIPS, Alpha, SPARC Move HP PA Any register-register instruction can annul the following instruction IA-64 Full predication