here for Chapter 2 Part 1

advertisement
Chapter 2:
ILP and Its Exploitation
•
•
•
•
•
•
•
•
Review simple static pipeline
ILP Overview
Dynamic branch prediction
Dynamic scheduling, out-of-order execution
Multiple issue (superscalar)
Hardware-based speculation
ILP limitation
Intel P6 microarchitecture
1
Advanced Processor Pipelining
• Focus on exploit Instruction-Level Parallelism (ILP):
– Definition: Executing multiple instructions (within a single
program thread) simultaneously
– Note that even ordinary pipelining does some ILP
(overlapping execution of multiple instructions).
– Focus of this chapter: Increasing ILP even more by
allowing out-of-order execution with/without speculation
– Using multiple-issue datapaths to initiate multiple
instructions simultaneously for further improvement
Microarchitectures that do this are called superscalar
– Examples: PowerPC, Pentium, etc.
2
Pipeline Performance
• Ideal pipeline CPI = 1 is minimum number of
cycles per instruction issued, if no stalls occur.
– May be <1 in superscalar machines.
• E.g., Ideal CPI=1/3 in 3-way superscalar (often use IPC=3
in superscalar)
• Real pipeline CPI = Ideal pipeline CPI + Structural
Stalls + Data Hazard Stalls + Control Hazard Stalls
• Maximize performance using various techniques to
eliminate stalls and reduce ideal CPI.
• Note: Real pipeline CPI still need to account for
cache misses (discuss later).
3
Advanced Pipelining Techniques
Technique
Reduces
Loop unrolling
Control stalls
Basic pipeline scheduling / forwarding
RAW stalls
Dynamic scheduling w. scoreboarding
RAW stalls
Dyna. sched. with register renaming
WAR & WAW stalls
Dynamic branch prediction
Control stalls
Issuing multiple instructions per cycle
Ideal pipeline CPI
Compiler dependence analysis/software Ideal CPI & data stalls
Software pipelining & trace scheduling
Ideal CPI & data stalls
Hardware Speculation
All data & control stalls
Dynamic memory disambiguation
RAW stalls involving mem.
4
Instruction-Level Parallelism (ILP)
• Basic Block (BB) ILP is quite small
– BB: a straight-line code sequence with no branches in
and out except for the entry and at the exit
– average dynamic branch frequency 15% to 25%
=> 4 to 7 instructions execute between two branches
– Plus instructions in BB likely to depend on each other
• To obtain performance enhancements, we must
exploit ILP across multiple basic blocks
• Simplest: loop-level parallelism to exploit
parallelism among iterations of a loop. E.g.,
for (i=1; i<=1000; i=i+1)
x[i] = x[i] + y[i];
5
Dependences
• A dependence is a way in which one instruction
can depend on (be impacted by) another for
scheduling purposes.
• Three major dependence types:
– Data (True) dependence: RAW
– Name dependence: WAR, WAW
– Control dependence: branch, jump, etc.
• A dependency (or dependence) is a particular
instance of one instruction depending on another.
– The instructions can’t be effectively (as opposed to just
syntactically) fully parallelized, or reordered.
6
Data Dependence
• Recursive definition:
– Instruction B is data dependent on instruction A iff:
• B uses a data result produced by instruction A, or
• There is another instruction C such that B is data
dependent on C, and C is data dependent on A.
• Potential for a RAW hazard
• Loop: LD
ADDD
SD
SUBI
BNEZ
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,#8
R1,Loop
A
A
B
C
B
Data dependencies
in loop example
7
Name Dependence
• Occurs when two instructions both access the
same data storage location due to reuse the
storage (Also called storage dependence, at least
one of the accesses must be a write.)
• Two sub-types (for inst. B after inst. A):
– Antidependence: A reads, then B writes.
• Potential for a WAR hazard.
– Output dependence: A writes, then B writes.
• Potential for a WAW hazard.
• Note: Name dependencies can be avoided by
changing instructions to use different locations
(rather than reusing a location).
8
WAR, WAW Examples
• WAR Hazard: InstrJ writes operand before InstrI
reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• WAW Hazard: InstrJ writes operand before InstrI
writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
9
Control Dependence
• Occurs when the execution of an instruction
depends on a conditional branch instruction.
• Program control flow must follow for correct
executions.
• However, only two things must really be preserved:
– Data flow (how a given result is produced)
– Exception behavior (must handle in order)
• Example: (for Exception Behavior)
L1:
DADDU
R2, R3, R4
BEQZ
R2, L1
LW
R1, 0(R2) ; may not before BEQZ
; due to exception & R1
10
Control Dependence – Another Example
• Example: (for data flow)
DADDU
R2, R3, R4
BEQZ
R5, L1
DSUBU
R2, R6, R7
L1:
OR
R8, R2, R9
• OR depends DADDU and DSUBU. Maintaining
data dependences is not enough
• Control flow decides where the correct R2 comes
from (DADDU or DSUBU)
11
Relaxing Control Dependence
• Only two things must really be preserved:
– Data flow (how a given result is produced)
– Exception behavior
• Some techniques permit removing control
dependence from instruction execution, by
dependently ignoring instruction results instead
– Speculation (betting on branches, to fill delay slots)
• Make instructions unconditional if no harm done
– Speculative multiple-execution
• Take both paths, invalidate results of one later
– Conditional / predicated instructions (used in IA-64).
• Note, Instruction reordering around branch must
has no side-effect
12
Loop Unrolling
• This code, add a scalar to a vector:
for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
• Assume following latencies for all examples
– Ignore delayed branch in these examples
Instruction
Instruction
Latency
stalls between
producing result using result
in cycles
in cycles
--------------------------------------------------------------------------------------------------------FP ALU op
Another FP ALU op
4
3
FP ALU op
Store double
3
2
Load double
FP ALU op
1
1
Load double
Store double
1
0
Integer op
Integer op
1
0
13
MIPS Code
• First translate into MIPS code:
– To simplify, assume 8 is lowest address
Loop: L.D
ADD.D
S.D
DADDUI
BNEZ
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,-8
R1,Loop
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
14
Execution Cycles without Inst Scheduling
1 Loop: L.D
2
stall
3
ADD.D
4
stall
5
stall
6
S.D
7
DADDUI
8
stall
9
BNEZ
Instruction
producing result
FP ALU op
FP ALU op
Load double
F0,0(R1) ;F0=vector element
F4,F0,F2 ;add scalar in F2
0(R1),F4 ;store result
R1,R1,-8 ;decrement pointer 8B (DW)
;assumes can’t forward branch
R1,Loop ;branch R1!=zero
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
• 9 clock cycles: Rewrite code to minimize stalls?
15
Apply Instruction Scheduling
1 Loop:
2
3
4
5
6
7
L.D
DADDUI
ADD.D
stall
stall
S.D
BNEZ
F0,0(R1)
R1,R1,-8
F4,F0,F2
8(R1),F4
R1,Loop
;altered offset
• 7 clock cycles, but just 3 for execution (L.D, ADD.D,
S.D), 4 for loop overhead; Can we make faster?
16
Unroll Loop Four Times
1 Loop:
3
6
7
9
12
13
15
18
19
21
24
25
27
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNEZ
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#-32
R1,LOOP
;no DSUBUI/BNEZ
;drop loop
;drop loop
;alter to 4*8
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
17
Unrolled and Rescheduled Loop
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
S.D
DSUBUI
S.D
BNEZ
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
8(R1),F16 ; 8-32 = -24
R1,LOOP
• 14 clock cycles, or 3.5 per iteration
18
Loop Unrolling Decisions
•
Requires understanding how one instruction depends
on another and how the instructions can be changed
or reordered given the dependences:
1. Determine loop unrolling useful by finding that loop
iterations were independent (except for maintenance code)
2. Use different registers to avoid unnecessary constraints
forced by using same registers for different computations
3. Eliminate the extra test and branch instructions and adjust
the loop termination and iteration code
4. Determine that loads and stores in unrolled loop can be
interchanged by observing that loads and stores from
different iterations are independent
1. Transformation requires analyzing memory addresses and
finding that they do not refer to the same address
5. Schedule the code, preserving any dependences needed to
yield the same result as the original code
19
Three Unrolling Considerations
•
Decrease in amount of overhead amortized with
each extra unrolling
– Amdahl’s Law
•
Growth in code size
– For larger loops, concern it increases the instruction
cache miss rate
•
Register pressure: potential shortfall in registers
created by aggressive unrolling and scheduling
– If not be possible to allocate all live values to registers,
may lose some or all of its advantage
 Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction
20
Static Branch Prediction
• Delayed branch
• To reorder code around branches, need to predict
branch statically when compile
– Predict a branch as taken:
Average misprediction = untaken frequency = 34% SPEC
• More accurate
prediction:
– Profile based
21
Dynamic Branch Prediction
•
Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts
of way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Seems to be
– There are a small number of important branches in
programs which have dynamic behavior
22
Dynamic Branch Prediction
• As the amount of ILP exploited increases (CPI
decreases), impact of control stalls increases.
– Branches come more often
– An n-cycle delay postpones more instructions
• Dynamic Hardware Branch Prediction
– “Learns” which branches are taken, or not
– Make the right guess (most of the time) about whether a
branch is taken, or not.
– Delay depends on whether prediction is correct, and
whether branch is taken.
23
Branch-Prediction Buffers (BPB)
• Also called “branch history table”
• Low-order n bits of branch address used to index
a table of branch history data for prediction.
– May have “collisions” between distant branches.
– Associative tables also possible
• In each entry, k bits of information about history
of that branch are stored.
– Common values of k: 1, 2, and larger
• Entry is used to predict what branch will do.
• Actual behavior of branch will update the entry.
24
1-bit Branch Prediction
• The entry for a branch has only two states:
– Bit = 1
• “The last time this branch was encountered, it was taken.
I predict it will be taken next time.”
– Bit = 0
• “The last time this branch was encountered, it was not
taken. I predict it will not be taken next time.”
• Will make 2 mistakes each time a loop is
encountered.
– At the end of the first & last iterations.
• May always mispredict in pathological cases!
25
2-bit Branch Prediction
• 4 states
– Based on the most recent two branch history actions
• Only 1 mis-prediction
per loop execution,
after the first time
the loop is reached.
Most
recent
State run of 2
0
– Last iteration
• What about n-bit
predictor?
1
Not
Taken
Not
Taken
Most
recent
action
Not
Taken
Taken
Predict
Not
Taken
Not
Taken
2
Taken
Not
Taken
Taken
3
Taken
Taken
Taken
26
State Transition Diagram
27
Implementing Branch Histories
• Separate “cache” (prediction table) accessed
during IF stage
• Extra bits in instruction cache
• Problem with this approach in MIPS:
– After fetch, don’t know whether the instruction is really
a branch or not (until decoding)
– Also don’t know the target address.
– In MIPS, by the time you know these things (in ID), you
already know whether it’s really taken!
– Haven’t saved any time!
– Branch-Target Buffers can fix this problem (later)...
28
Misprediction Rate for 2-bit BPB
29
Branch-Prediction Performance
• Contribution to cycle count depends on:
– Branch frequency & misprediction frequency
– Freqs. of taken/not taken, predicted/mispredicted.
– Delay of taken/not taken, predicted/mispredicted.
• How to reduce misprediction frequency?
– Increase buffer size to avoid collisions.
• Has little effect beyond ~4,096 entries.
– Increase prediction accuracy
• Increase # of bits/entry (little effect beyond 2)
• Use a different prediction scheme (correlated predictors,
tournament predictors)
30
Exhaustive Search for Optimal 2-bit Predictor
• There are 2^20 possible state machines of 2-bit
predictors
• Some machines are uninteresting, pruning them out
reduces the number of state machines to 5248
• For each benchmark, determine the prediction
accuracy for all the predictor state machines
• Optimal 2-bit predictor for each application (by IBM)
– spice2g6 97.2%
– doduc 94.3%
– gcc 89.1%
– espresso 89.1%
– li 87.1%
– eqntott 87.9%
31
Correlated Prediction - Example
• Code fragment from eqntott:
if (aa==2) aa=0;
if1
if (bb==2) bb=0;
if2
if (aa!=bb) { … };  false if (if1 & if2)
• MIPS code (aa=R1, bb=R2):
SUBUI
BNEZ
ADD
L1: SUBUI
BNEZ
ADD
L2: SUBU
BEQZ
R3,R1,#2
R3,L1
R1,R0,R0
R3,R2,#2
R3,L2
R2,R0,R0
R3,R1,R2
R3,L3 …
;
;
;
;
;
;
;
;
(aa-2)
branch b1 (aa!=2)
aa=0
(bb-2)
branch b2 (bb!=2)
bb=0
(aa-bb)
branch b3 (aa==bb)
32
Even Simpler Example
• C code:
if (d==0) d=1; if (d==1) ...
• MIPS code (d=R1):
BNEZ
ADDI
L1: SUBUI
BNEZ
R1,L1
R1,R0,#1
R3,R1,#1
R3,L2
;
;
;
;
b1: d!=0
d=1
(d-1)
b2: d!=1
(and others)
Any Correlation (b1, b2)?
33
Using 1-bit Predictor
• Suppose value of ‘d’ alternates between 2 and 0,
and the code repeat multiple times
• All branches are mispredicted!! (with initial NT, NT)
34
Correlating Predictors
• Have different predictions for the current branch
depending on the previously executed branch
instruction was taken or not.
• Notation: _ / _ (separate prediction table)
What to predict
if the last branch
was NOT taken
What to predict
if the last branch
was TAKEN
Prediction used
is shown in bold
Based on the last outcome
35
(m,n) correlated predictors
• Uses the behavior of the most recent m branches
encountered to select one of 2m different branch
predictors for the next branch.
• Each of these predictors records n bits of history
information for any given branch.
• On previous slide we saw a (1,1) predictor.
• Easy to implement:
– Behavior of last m branches: an m-bit shift register
– Branch-prediction buffer: access with low-order bits of
branch address, concatenated with shift register.
36
(2,2) Correlated Predictor
0
1
(Correction based on the
last 2 branch outcomes)
37
Correlated Predictors Better
38
Tournament Predictors
• Three predictors: Global correlated predictor,
Local 2-bit predictor, and a tournament predictor
• The Tournament predictor determines which
(global or local) predictor to be used for prediction
39
Performance Tournament Predictor
40
Branch-Target Buffers (BTB)
• How to know the address of the next instruction as
soon as the current instruction is fetched?
• Normally, an extra (ID) cycle is needed to:
– Determine that the first instruction is a branch
– Determine whether the branch is taken
– Compute the target address PC+offset
• Branch prediction alone doesn’t help – Need next PC
• What if, instead, the next instruction address could
be fetched at the same time that the current
instruction is fetched?  BTB
41
BTB Schematic
42
Handling Instruction with BTB
Fetch from target
This flowchart based on 1-bit predictor
43
Branch Penalties, Branch Folding
• If instruction not in BTB and branch not taken (case
not shown), penalty is 0.
• Store target instructions instead of their addresses
in BTB
– Saves on fetch time.
– Permits branch folding - zero-cycle branches! Substitute
destination instruction for branch in pipeline!
44
Return Address Predictor
• Predicting register/indirect branches
E.g., abstract function calls, switch statements, procedure returns.
CPU-internal return-address stack
45
Dynamic Branch Prediction Summary
• Prediction becoming important part of execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches
correlated with next branch
– Either different branches (GA)
– Or different executions of same branches (PA)
• Tournament predictors take insight to next level,
by using multiple predictors
– usually one based on global information and one based
on local information, and combining them with a
selector
– In 2006, tournament predictors using  30K bits are in
processors like the Power5 and Pentium 4
• Branch Target Buffer: include branch address &
prediction
46
Download