Pitfall

advertisement
3.13. Fallacies and Pitfalls
• Fallacy: Processors with lower CPIs will
always be faster
• Fallacy: Processors with faster clock rates
will always be faster
– Balance must be found:
• E.g. sophisticated pipeline: CPI ↓ clock cycle ↑
Fallacies and Pitfalls
• Pitfall: Emphasizing improving CPI by
increasing issue rate, while sacrificing
clock rate can decrease performance
– Again, question of balance
• SuperSPARC –vs– HP PA 7100
– Complex interactions between cycle time and
organisation
Fallacies and Pitfalls
• Pitfall: Improving only one aspect of a
multiple-issue processor and expecting
overall performance improvement
– Amdahl’s Law!
– Boosting performance of one area may uncover
problems in another
Fallacies and Pitfalls
• Pitfall: Sometimes bigger and dumber is
better!
– Alpha 21264: sophisticated multilevel
tournament branch predictor
– Alpha 21164: simple two-bit predictor
– 21164 performs better for transaction
processing application!
• Can handle twice as many local branch predictions
Concluding Remarks
• Lots of open questions!
– Clock speed –vs– CPI
– Power issues
– Exploiting parallelism
• ILP –vs– explicit
Characteristics of Modern (2001)
Processors
• Figure 3.61
–
–
–
–
–
–
3–4 way superscalar
4–22 stage pipelines
Branch prediction
Register renaming (except UltraSPARC)
400MHz – 1.7GHz
7–130 million transistors
Chapter 4
Exploiting ILP with Software
4.1. Compiler Techniques for
Exposing ILP
• Compilers can improve the performance of
simple pipelines
– Reduce data hazards
– Reduce control hazards
Loop Unrolling
• Compiler technique to increase ILP
– Duplicate loop body
– Decrease iterations
• Example:
– Basic code: 10 cycles per iteration
– Scheduled: 6 cycles
for (int k = 0; k < 1000; k++)
{ x[k]
= x[k] + s;
}
Loop Unrolling
for (int k = 0; k < 1000; k+=4)
{ x[k]
= x[k] + s;
for x[k+1]
(int k == x[k+1]
0; k < +1000;
s; k++)
{ x[k+2]
x[k] == x[k+2]
x[k] + +s;s;
} x[k+3] = x[k+3] + s;
}
• Basic code: 7 cycles per “iteration”
• Scheduled: 3.5 cycles (no stalls!)
Loop Unrolling
• Requires clever compilers
– Analysing data dependences, name
dependences and control dependences
• Limitations
–
–
–
–
Code size
Decrease in amortisation of overheads
“Register pressure”
Compiler limitations
• Useful for any architecture
Superscalar Performance
• Two-issue MIPS (int + FP)
• 2.4 cycles per “iteration”
– Unrolled five times
4.2. Static Branch Prediction
• Useful:
– where behaviour can be predicted at compiletime
– to assist dynamic prediction
• Architectural support
– Delayed branches
Static Branch Prediction
• Simple:
– Predict taken
– Has average misprediction rate of 34% (SPEC)
– Range: 59% – 9%
• Better:
– Predict backward taken, forward not-taken
– Worse for SPEC!
Static Branch Prediction
• Advanced compiler analysis can do better
• Profiling is very useful
– FP: 9% ± 4%
– Int: 15% ± 5%
4.3. Static Multiple Issue: VLIW
• Compiler groups instructions into
“packets”, checking for dependences
– Remove dependences
– Flag dependences
• Simplifies hardware
VLIW
• First machines used a wide instruction with
multiple operations per instruction
– Hence Very Long Instruction Word (VLIW)
– 64–128 bits
• Alternative: group several instructions into
an issue packet
VLIW Architectures
• Multiple functional units
• Compiler selects instructions for each unit
to create one long instruction/an issue
packet
• Example: five operations
– Integer/branch, 2 × FP, 2 × memory access
• Need lots of parallelism
– Use loop unrolling, or global scheduling
Example
for (int k = 0; k < 1000; k++)
{ x[k]
= x[k] + s;
}
• Loop unrolled seven times!
• 1.29 cycles per result
• 60% of available instruction “slots” filled
Summary of Improvements
Technique
Basic code
Loop unrolled (4)
Superscalar (5)
VLIW (7)
Unscheduled
Scheduled
10
6
7
3.5
2.4
1.29
Drawbacks of Original VLIWs
• Large code size
– Need to use loop unrolling
– Wasted space for unused slots
• Clever encoding techniques, compression
• Lock-step execution
– Stalling one unit stalls them all
• Binary code compatibility
– Variations on structure required recompilation
4.4. Compiler Support for
Exploiting ILP
• We will not cover this section in detail
• Loop unrolling
– Loop-carried dependences
• Software pipelining
– Interleave instructions from different iterations
4.5. Hardware Support for
Extracting More Parallelism
• Techniques like loop-unrolling work well
when branch behaviour can be predicted at
compile time
• If not, we need more advanced techniques:
– Conditional instructions
– Hardware support for compiler speculation
Conditional or Predicated Instructions
• Instructions have associated conditions
– If condition is true execution proceeds normally
– If not, instruction becomes a no-op
cmovz
bnez
%r8,
%r8,
%r1,
L1 %r2
nop
mov %r1, %r2
L1: ...
• Removes control hazards
if (a == 0)
b = c;
Conditional Instructions
• Control hazards effectively replaced by data
hazards
• Can be used for speculation
– Compiler reorders instructions depending on
likely outcome of branches
Limitations on Conditional
Instructions
• Annulled instructions still execute
– But may occupy otherwise stalled time
• Most useful when conditions evaluated
early
• Limited usefulness for complex conditions
• May be slower than unconditional
operations
Conditional Instructions in
Practice
Machine
Conditional Instructions
MIPS,
Alpha,
SPARC
Move
HP PA
Any register-register instruction can
annul the following instruction
IA-64
Full predication
Download