6.004 Computation Structures

advertisement
MIT OpenCourseWare
http://ocw.mit.edu
6.004 Computation Structures
Spring 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
CPU Performance
Pipelining the Beta
We’ve got a working Beta… can we make it fast?
betta ('be-t&) n. Any of various species of
small, brightly colored, long-finned
freshwater fishes of the genus Betta,
found in southeast Asia.
MIPS = Millions of Instructions/Second
maybe they’ll
give me partial
credit...
beta (‘bA-t&, ‘bE-) n. 1. The second letter
of the Greek alphabet. 2. The exemplary
computer system used in 6.004.
MIPS =
Freq
CPI
Freq = Clock Frequency, MHz
CPI = Clocks per Instruction
I don’t think they mean the fish...
To Increase MIPS:
1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.
- CPI below 1.0? Tough... you’ll see multiple instruction issue
machines in 6.823.
2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
- PIPELINING is the key to improved performance through fast
clocks.
Lab #7 due Tonight!
4/30/09
6.004 – Spring 2009
LDR(X,R3)
LD(R1,10,R0)
L22 – Pipelining the Beta 1
Beta Timing
CLK
New PC
PC+4
modified 4/27/09 11:17
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 2
Why isn’t this a 20-minute lecture?
We’ve learned how to pipeline combinational circuits.
Fetch Inst.
Control Logic
+OFFSET
Wanted:
longest paths
What’s the big deal?
RA2SEL mux
Complications:
Read Regs
=0?
ASEL mux
BSEL mux
• operations have variable
execution times (eg, ALU)
ALU
Fetch data
PCSEL mux
WDSEL mux
PC setup
RF setup
“precedence
graph”
6.004 – Spring 2009
• some apparent paths aren’t
“possible”
• time axis is not to scale (eg,
tPD,MEM is very big!)
Mem setup
CLK
4/30/09
L22 – Pipelining the Beta 3
1. The Beta isn’t combinational…
Explicit state in register file, memory;
Hidden state in PC.
2. Consecutive operations – instruction executions – interact:
• Jumps, branches dynamically change instruction sequence
• Communication through registers, memory
Our goals:
• Move slow components into separate pipeline stages, running
clock faster
• Maintain instruction semantics of unpipelined Beta as far as
possible
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 4
First Steps:
Ultimate Goal: 5-Stage Pipeline
ILL
XAdr OP
4
PCSEL
GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to
barely include slowest components (mems, regfile, ALU)
3
A Simple 2-Stage Pipeline
JT
2
1
PCIF
RF
ALU
MEM
WB
4/30/09
6.004 – Spring 2009
L22 – Pipelining the Beta 5
<20:16>
..
ADDC(r1, 1,
SUBC(r1, 1,
XOR(r1, r5,
MUL(r2, r6,
...
i
i+1
i+2
IF ADDC SUBC XOR
i+3
MUL
ADDC SUBC XOR
4/30/09
RD1
1
0
1
A
RA2SEL
RA2
WD
RD2
WE
WERF
0
BSEL
B
ALU
ALUFN
WD
R/W
Wr
Data Memory
Adr
RD
PC+4
0
1
2
WDSEL
4/30/09
6.004 – Spring 2009
L22 – Pipelining the Beta 6
Pipeline Control Hazards
LOOP:
BUT consider instead:
EXE
i+6
1
JT
ASEL
...
MUL
WA
WA
0
Z
<15:0>
IF ADD
i+5
Register
File
RA1
1
<25:21>
i
i+4
0
r2)
r3)
r1)
r0)
TIME (cycles)
<25:21>
<15:11>
WASEL
XP
Executed on our 2-stage pipeline:
Pipeline
IREXE
00
+
2-Stage Pipelined Beta Operation
Consider a sequence
of instructions:
EXE
D
Instruction Fetch stage: Maintains PC, fetches
one instruction per cycle and passes it to
Register File stage: Reads source operands from
register file, passes them to
ALU stage: Performs indicated operation, passes
result to
Memory stage: If it’s a LD, use ALU result
as an address, pass mem data
(or ALU result if not LD) to
Write-Back stage: writes result back into
register file.
IF
Instruction
Memory
+4
PCEXE
6.004 – Spring 2009
00
A
APPROACH: structure processor as 5-stage pipeline:
EXE
IF
0
i+1
i+2
i+3
i+4
CMP
BT
XOR
...
ADD
CMP
BT
?
i+5
ADD(r1, r3, r3)
CMPLEC(r3, 100, r0)
BT(r0, LOOP)
XOR(r3, -1, r3)
MUL(r1, r2, r2)
...
i+6
...
This is the cycle where the branch decision
is made… but we’ve already fetched the
following instruction which should be executed
only if branch is not taken!
...
L22 – Pipelining the Beta 7
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 8
Branch Delay Slots
Branch Alternative 1
PROBLEM: One (or more) following instructions have been prefetched by the time a branch is taken.
POSSIBLE SOLUTIONS:
Make the hardware annul
instructions in the branch
delay slots of a taken
branch.
1. Make hardware “annul” instructions following
branches which are taken, e.g., by disabling WERF and
WR.
i
NOP = ‘no-operation’, e.g.
ADD(R31, R31, R31)
2. “Program around it”. Either
IF ADD
EXE
a) Follow each BR with a NOP instruction; or
LOOP:
i+1
i+2
i+3
i+4
i+5
CMP
BT
XOR
ADD
CMP
BT
ADD
CMP
BT
XOR
ADD
CMP
Always execute instructions in delay slots
ii.
Conditionally execute instructions in delay slots
4/30/09
6.004 – Spring 2009
Branch taken
Pros: same program runs on both unpipelined and pipelined hardware
Cons: in SPEC benchmarks 14% of instructions are taken branches 12% of total cycles are annulled
L22 – Pipelining the Beta 9
PCSEL
4
3
4/30/09
6.004 – Spring 2009
Branch Annulment Hardware
ILL
XAdr OP
L22 – Pipelining the Beta 10
Branch Alternative 2a
JT
2
1
PCIF
0
+4
PCEXE
Fill branch delay slots with
NOP instructions (i.e., the
software equivalent of
alternative 1)
Instruction
Memory
00
A
D
0
1
NOP
ANNULIF
IREXE
00
<20:16>
LOOP:
ADD(r1, r3, r3)
CMPLEC(r3, 100, r0)
BT(r0, LOOP)
NOP()
XOR(r3, -1, r3)
MUL(r1, r2, r2)
...
<25:21>
<15:11>
0
+
1
RA2SEL
WASEL
XP
<25:21>
WA
WA
0
ASEL
Register
File
RA1
1
RD1
Z
<15:0>
RA2
WD
RD2
WE
i
WERF
i+1
i+2
i+3
i+4
i+5
CMP
BT
NOP
ADD
CMP
BT
ADD
CMP
BT
NOP
ADD
CMP
i+6
JT
IF ADD
1
0
1
A
0
BSEL
EXE
B
ALU
ALUFN
WD
R/W
Wr
Branch taken
Data Memory
Adr
RD
Pros: same as alternative 1
Cons: NOPs make code longer; 12% of cycles spent executing NOPs
PC+4
0
6.004 – Spring 2009
i+6
NOP
b) Make compiler clever enough to move USEFUL instructions
into the branch delay slots
i.
ADD(r1, r3, r3)
CMPLEC(r3, 100, r0)
BT(r0, LOOP)
XOR(r3, -1, r3)
MUL(r1, r2, r2)
...
4/30/09
1
2
WDSEL
L22 – Pipelining the Beta 11
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 12
Branch Alternative 2b(i)
Put USEFUL instructions in
the branch delay slots;
remember they will be
executed whether the
branch is taken or not
i
IF ADD
EXE
Branch Alternative 2b(ii)
We need to
add this silly
instruction
to UNDO the
effects of
LOOP: ADD(r1,r3,r3)
LOOPx: CMPLEC(r3,100,r0) that last
ADD
BT(r0,LOOPx)
ADD(r1,r3,r3)
SUB(r3,r1,r3)
XOR(r3,-1,r3)
MUL(r1,r2,r2)
...
i+1
i+2
i+3
i+4
CMP
BT
ADD
CMP
BT
ADD
CMP
BT
ADD
CMP
i+5
LOOP: ADD(r1, r3, r3)
LOOPx: CMPLEC(r3, 100, r0)
BT.taken(r0, LOOPx)
ADD(r1, r3, r3)
XOR(r3, -1, r3)
MUL(r1, r2, r2)
...
Put USEFUL instructions in
the branch delay slots;
annul them if branch doesn’t
behave as predicted
i+6
i+1
i+2
i+3
i+4
CMP
BT
ADD
CMP
BT
ADD
CMP
BT
ADD
CMP
i
IF ADD
ADD
EXE
BT
Branch taken
i+6
ADD
BT
Branch taken
Pros: only two “extra” instructions are executed (on last iteration)
Cons: finding “useful” instructions that are always executed
is difficult; clever rewrite may be required. Program executes
differently on naïve unpipelined implementation.
4/30/09
6.004 – Spring 2009
i+5
L22 – Pipelining the Beta 13
Pros: only one instruction is annulled (on last iteration); about 70%
of branch delay slots can be filled with useful instructions
Cons: Program executes differently on naïve unpipelined implementation;
not really useful with more than one delay slot.
4/30/09
6.004 – Spring 2009
L22 – Pipelining the Beta 14
ILL
XAdr OP JT
Architectural Issue:
Branch Decision Timing
PCSEL
4
3
2
1
0
IF
00
PC
A
D
+4
Instruction
Fetch
IR
Ra <20:16>
Wow! I guess those guys really
were thinking when they made up
all those instructions
6.004 – Spring 2009
C: <15:0>
sign-extended
RF
ASEL
(read)
ALU
PC
A
ALU
IR
1
1
0
A
Memory
instruction
0
BSEL
ALU
D
B
B
ALUFN
CL
Register
File
JT
<PC>+C
CL
Rc <25:21>
Register RA2
File RD2
A
ALU
ALTERNATIVES:
• Compare-and-branch...
(eg, if Reg[Ra] > Reg[Rb])
• MORE powerful, but
• LATER decision (hence more
delay slots)
RA1
WA
RD1
Z
instruction
Rb: <15:11>
B
ALU
Y
ALU
Y
MEM
PC
MEM
MEM
IR
MEM
Y
D
Adr
instruction
WD
R/W
Data Memory
Y
RD
CL
instruction
Treat register file as two
separate devices:
combinational READ,
clocked WRITE at end of
pipe.
ALU
CL
Write
Back
4-Stage
Beta Pipeline
RA2SEL
C: <15:0> << 2
sign-extended
IF
instruction
Register
File
Instruction
Fetch
RF
RF
PC
+
BETA approach:
• SIMPLE branch condition logic ...
Test for Reg[Ra] = 0!
• ADVANTAGE: early decision,
single delay slot
Instruction
Memory
RF
(write)
What other information do
we have to pass down
pipeline?
PC
(return addresses)
instruction fields
Suppose decision were made in the ALU
stage ... then there would be 2 branch
delay slots (and instructions to annul!)
4/30/09
L22 – Pipelining the Beta 15
(NB: SAME RF
AS ABOVE!)
6.004 – Spring 2009
XP
Rc <25:21>
WASEL
0
1
WA
0
1
2
RegisterWD
File
WE
Write
Back
WDSEL
(decoding)
What sort of improvement
should expect in cycle time?
WERF
4/30/09
L22 – Pipelining the Beta 16
4-Stage Beta Operation
Consider a sequence
of instructions:
...
ADDC(r1, 1,
SUBC(r1, 1,
XOR(r1, r5,
MUL(r2, r6,
...
Pipeline “Data Hazard”
CMPLEC(r3, 100, r0)
r2)
r3)
r1)
r0)
MULC(r1, 100, r4)
SUB(r1, r2, r5)
IF
TIME (cycles)
Pipeline
IF
i+1
i+2
i+3
ADDC SUBC XOR
RF
MUL
ADDC SUBC XOR
ALU
i+4
ADDC SUBC XOR
WB
i+5
ADD
WB
...
ADDC SUBC XOR
...
L22 – Pipelining the Beta 17
SUB
CMP MUL
SUB
r3
available
L22 – Pipelining the Beta 18
Stall the pipeline:
Freeze IF, RF stages for 2 cycles, inserting NOPs
into ALU-stage instruction register
IF
ADD(r1, r2, r3)
ADD(r1, r2, r3)
CMPLEC(r3, 100, r0)
MULC(r1, 100, r4)
RF
SUB(r1, r2, r5)
ALU
CMPLEC(r3, 100, r0)
WB
How often can we do this?
Programmer’s fallback: Insert NOPs (sigh!)
4/30/09
r3
fetched
4/30/09
i
6.004 – Spring 2009
i+6
SUB
CMP MUL
ADD
6.004 – Spring 2009
EXAMPLE: Rewrite
SUB(r1, r2, r5)
i+5
Data Hazard Solution 2
“Program around it”
... document weirdo semantics, declare it a software problem.
- Breaks sequential semantics!
- Costs code efficiency.
as
i+4
Oops! CMP is trying to read Reg[R3] during cycle
i+2 but ADD doesn’t write its result into
Reg[R3] until the end of cycle i+3!
MUL
Data Hazard Solution 1
MULC(r1, 100, r4)
i+3
SUB
CMP MUL
ADD
ALU
MUL
i+2
CMP MUL
i+6
4/30/09
6.004 – Spring 2009
ADD
RF
...
MUL
i+1
i
Executed on our 4-stage pipeline:
i
ADD(r1, r2, r3)
BUT consider instead:
ADD
i+1
i+2
i+3
i+4
i+5
SUB
i+6
CMP
MUL
MUL
MUL
ADD
CMP
CMP
CMP
MUL
ADD
NOP
NOP
CMP
MUL
NOP
CMP
ADD
NOP
SUB
Drawback: NOPs mean “wasted” cycles
L22 – Pipelining the Beta 19
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 20
Data Hazard Solution 3
Bypass Paths (I)
Bypass (aka forwarding) Paths:
RF
IR
Add extra data paths & control logic to re-route
data in problem cases.
i
ADD
IF
RF
i+1
i+2
i+3
CMP
MUL
SUB
ADD
CMP
i+4
i+5
RA1
CMPLEC(r3,100,r0)
i+6
ALU
IR
RD1
Register
File
A
SUB
ALU
ADD CMP
MUL
SUB
WB
ADD
CMP
MUL
B
i.e., instructions which use
ALU to compute result
Y
WB
WB
Y
and RaRF != R31
r3
available
SUB
Idea: the result from the ADD which will be written into the register file at
the end of cycle I+3 is actually available at output of ALU during cycle I+2
– just in time for it to be used by CMP in the RF stage!
4/30/09
Bypass
muxes
SELECT this BYPASS path if
OpCodeRF = reads Ra
and OpCodeALU = OP, OPC
and RaRF = RcALU
ALU
IR
6.004 – Spring 2009
RD2
B
A
ADD(r1,r2,r3)
MUL
RA2
L22 – Pipelining the Beta 21
WA
6.004 – Spring 2009
Bypass Paths (II)
Register
File
WD
WE
4/30/09
L22 – Pipelining the Beta 22
Next Time
RF
IR
RA1
ADD(r1,r2,r3)
RD1
ALU
IR
Register
File
A
RA2
RD2
B
A
Bypass
muxes
B
ALU
MULC(r4,17,r5)
Y
But why can’t we get
It from the register file?
It’s being written this cycle!
WB
WB
Y
IR
XOR(r2,r6,r1)
WA
6.004 – Spring 2009
Register
File
More Beta
Bypasses Ahead
SELECT this BYPASS path if
OpCodeRF = reads Ra
and RaRF != R31
and not using ALU bypass
and WERF = 1
and RaRF = WA
WD
WE
4/30/09
L22 – Pipelining the Beta 23
6.004 – Spring 2009
4/30/09
L22 – Pipelining the Beta 24
Download