L16-BranchPrediction-1 - Computation Structures Group

advertisement
Computer Architecture: A Constructive Approach
Branch Prediction - 1
Arvind
Computer Science & Artificial Intelligence Lab.
Massachusetts Institute of Technology
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-1
Control Flow Penalty
Next fetch
started
I-cache
Modern processors may
have > 10 pipeline stages
between next PC calculation
and branch resolution !
Fetch
Buffer
Fetch
Decode
Issue
Buffer
How much work is lost if
pipeline doesn’t follow
correct instruction flow?
~ Loop length x pipeline
width
PC
Func.
Units
Branch
executed
Result
Buffer
Execute
Commit
Arch.
State
April 9, 2012
http://csg.csail.mit.edu/6.S078
L12-2
Average Run-Length
between Branches
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU
39 %
13 %
FPU Add
20 %
FPU Mult
13 %
load
26 %
23 %
store
9%
9%
branch
16 %
8%
other
10 %
12 %
SPECint92:
SPECfp92:
compress, eqntott, espresso, gcc , li
doduc, ear, hydro2d, mdijdp2, su2cor
What is the average run-length between branches?
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-3
MIPS Branches and Jumps
Each instruction fetch depends on one or two
pieces of information from the preceding
instruction:
1. Is the preceding instruction a taken branch?
2. If so, what is the target address?
Instruction
Taken known?
Target known?
J
After Inst. Decode
After Inst. Decode
JR
After Inst. Decode
After Reg. Fetch
BEQZ/BNEZ
After Exec
After Inst. Decode
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-4
Currently our simple pipelined
architecture does very simple
branch prediction
What is it?
Branch is predicted not taken: pc, pc+4, pc+8, …
Can we do better?
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-5
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Use saturating counter
 On taken
On ¬taken 
April 9, 2012
1
1
Strongly taken
1
0
Weakly taken
0
1
Weakly ¬taken
0
0
Strongly ¬taken
http://csg.csail.mit.edu/6.S078
L16-6
Branch History Table (BHT)
Fetch PC
00
k
I-Cache
Instruction
Opcode
BHT Index
2k-entry
BHT,
2 bits/entry
offset
+
Branch?
Target PC
Taken/¬Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-7
Where does BHT fit in the
processor pipeline?
BHT can only be used after instruction decode
What should we do at the fetch stage?
Need a mechanism to update the BHT

April 9, 2012
where does the update information come from
http://csg.csail.mit.edu/6.S078
L16-8
Overview of branch prediction
BP,
JMP,
Ret
Next Addr
Pred
P
C
Best predictors
reflect program
behavior
Decode
Reg
Read
Need next PC
immediately
Instr type,
PC relative
targets
available
Simple
conditions,
register targets
available
Tight loop
Loose loop
Loose loop
April 9, 2012
http://csg.csail.mit.edu/6.S078
Execute
Complex
conditions
available
Loose loop
L16-9
Next Address Predictor (NAP)
first attempt
predicted
target
BPb
Branch
Target
Buffer
(2k entries)
iMem
k
PC
target
BP
BP bits are stored with the predicted target address.
IF stage: nPC = If (BP=taken) then target else pc+4
later:
check prediction, if wrong then kill the instruction
and update BTB & BPb else update BPb
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-10
Address Collisions
132 Jump 100
Assume a
128-entry
NAP
1028 Add .....
target
236
BPb
take
Instruction
What will be fetched after the instruction at 1028? Memory
NAP prediction
Correct target
= 236
= 1032
 kill PC=236 and fetch PC=1032
Is this a common occurrence?
Can we avoid these bubbles?
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-11
Use NAP for Control Instructions only
NAP contains useful information for branch and
jump instructions only
 Do not update it for other instructions
For all other instructions the next PC is (PC)+4 !
How to achieve this effect without decoding the
instruction?
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-12
Branch Target Buffer (BTB)
a special form of NAP
I-Cache
2k-entry direct-mapped BTB
PC
Entry PC
Valid
predicted
target PC
k
=
match
valid
target
• Keep the (pc, predicted pc) in the BTB
• pc+4 is predicted if no pc match is found
• BTB is updated only for branches and jumps
Permits nextPC to be determined before instruction is decoded
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-13
Consulting BTB Before
Decoding
132 Jump 100
entry PC
132
target
236
BPb
take
1028 Add .....
• The match for pc =1028 fails and 1028+4 is fetched
 eliminates false predictions after ALU instructions
• BTB contains entries only for control transfer instructions
 more room to store branch targets
Even very small BTBs are very effective
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-14
Observations
There is a plethora of branch prediction
schemes – their importance grows with the
depth of processor pipeline
Processors often use more than one prediction
scheme
It is usually easy to understand the data
structures required to implement a particular
scheme
It takes considerably more effort to understand
how a particular scheme with its lookup and
updates is integrated in the pipeline and how
various schemes interact with each other
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-15
Plan
We will begin with a very simple 2-stage pipeline
and integrate a simple BTB scheme in it
We will extend the design to a multistage
pipeline and integrate at least one more
predictor, say BHT, in the pipeline (next lecture)
revisit the simple two-stage pipeline without
branch prediction
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-16
Decoupled Fetch and Execute
<updated pc>
Fetch
nextPC
Execute
ir
<instructions,
pc, epoch>
Fetch sends instructions to Execute along with
pc and other control information
Execute sends information about the target pc
to Fetch, which updates pc and other control
registers whenever it looks at the nextPC fifo
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-17
A solution using epoch
Add fEpoch and eEpoch registers to the
processor state; initialize them to the same
value
The epoch changes whenever Execute
determines that the pc prediction is wrong.
This change is reflected immediately in eEpoch
and eventually in fEpoch via nextPC FIFO
Associate the fEpoch with every instruction
when it is fetched
In the execute stage, reject, i.e., kill, the
instruction if its epoch does not match eEpoch
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-18
Two-Stage pipeline
PC
nextPC
fEpoch
Bypass
FIFO
+4
ir
eEpoch
A robust two-rule solution
Register File
Decode
Execute
Pipeline
FIFO
Data
Memory
Inst
Memory
Either fifo can be a normal (>1 element) fifo
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-19
Two-stage pipeline
Decoupled
module mkProc(Proc);
Reg#(Addr)
pc <- mkRegU;
RFile
rf <- mkRFile;
IMemory
iMem <- mkIMemory;
DMemory
dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool)
fEpoch <- mkReg(False);
Reg#(Bool)
eEpoch <- mkReg(False);
FIFOF#(Addr)
nextPC <- mkBypassFIFOF;
rule doFetch (ir.notFull);
explicit guard
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
pc<=nextPC.first; fEpoch<=!fEpoch; nextPC.deq;end
else pc <= pc + 4;
simple branch prediction
endrule
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-20
Two-stage pipeline
Decoupled cont
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.brTaken) begin
nextPC.enq(eInst.addr);
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-21
ir
+
PC
Branch
Predictor
eEpoch
nextPC
fEpoch
Two-Stage pipeline with a
Branch Predictor
Register File
Decode
Execute
ppc
Data
Memory
Inst
Memory
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-22
Branch Predictor Interface
interface NextAddressPredictor;
method Addr prediction(Addr pc);
method Action update(Addr pc,
Addr target);
endinterface
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-23
Null Branch Prediction
module mkNeverTaken(NextAddressPredictor);
method Addr prediction(Addr pc);
return pc+4;
endmethod
method Action update(Addr pc, Addr target);
noAction;
endmethod
endmodule
Replaces PC+4 with …

Already implemented in the pipeline
Right most of the time

April 9, 2012
Why?
http://csg.csail.mit.edu/6.S078
L16-24
Branch Target Prediction (BTB)
module mkBTB(NextAddressPredictor);
RegFile#(LineIdx, Addr)
tagArr <- mkRegFileFull;
RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull;
method Addr prediction(Addr pc);
LineIdx index = truncate(pc >> 2);
let tag = tagArr.sub(index);
let target = targetArr.sub(index);
if (tag==pc) return target; else return (pc+4);
endmethod
method Action update(Addr pc, Addr target);
LineIdx index = truncate(pc >> 2);
tagArr.upd(index, pc);
targetArr.upd(index, target);
endmethod
endmodule
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-25
Two-stage pipeline + BP
module mkProc(Proc);
Reg#(Addr)
pc <- mkRegU;
RFile
rf <- mkRFile;
IMemory
iMem <- mkIMemory;
DMemory
dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool)
fEpoch <- mkReg(False);
Reg#(Bool)
eEpoch <- mkReg(False);
FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF;
NextAddressPredictor bpred <- mkNeverTaken; Some
target
The definition of TypeFetch2Decode is changed to
predictor
include predicted pc
typedef struct {
Addr pc; Addr ppc; Bool epoch; Data inst;
} TypeFetch2Decode deriving (Bits, Eq);
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-26
Two-stage pipeline + BP
Fetch rule
rule doFetch (ir.notFull);
let ppc = bpred.prediction(pc);
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, ppc:ppc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
match{.ipc, .ippc} = nextPC.first;
pc <= ippc; fEpoch <= !fEpoch; nextPC.deq;
bpred.update(ipc, ippc);
end
else pc <= ppc;
endrule
April 9, 2012
http://csg.csail.mit.edu/6.S078
L16-27
Two-stage pipeline + BP
Execute rule
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
let irppc = ir.first.ppc;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, irppc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.missPrediction) begin
nextPC.enq(tuple2(irpc,
eInst.brTaken ? eInst.addr : irpc+4));
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule
http://csg.csail.mit.edu/6.S078
L16-28
April 9, 2012
Execute Function
function ExecInst exec(DecodedInst dInst, Data rVal1,
Data rVal2, Addr pc, Addr ppc);
ExecInst einst = ?;
let aluVal2 = (dInst.immValid)? dInst.imm : rVal2
let aluRes = alu(rVal1, aluVal2, dInst.aluFunc);
let brAddr = brAddrCal(pc, rVal1, dInst.iType,
dInst.imm);
einst.itype = dInst.iType;
einst.addr = (memType(dInst.iType)? aluRes : brAddr;
einst.data = dInst.iType==St ? rVal2 : aluRes;
einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp);
einst.missPrediction = brTaken ? brAddr!=ppc :
(pc+4)!=ppc;
einst.rDst = dInst.rDst;
return einst;
endfunction
http://csg.csail.mit.edu/6.s078Rev
L7-29
April 7, 2012
Download