6.004 Computation Structures

advertisement
MIT OpenCourseWare
http://ocw.mit.edu
6.004 Computation Structures
Spring 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Pipeline Issues
This pipeline stuff makes
my head hurt!
Recalling Data Hazards
Maybe it’s that
dumb hat
PROBLEM: Subsequent instructions
can reference the contents of a
register well before the pipeline stage
where the register is written.
i
IF
ADD
RF
i+1
i+2
i+3
CMP
MUL
SUB
ADD
CMP
MUL
SUB
ADD
CMP
MUL
SUB
ADD
CMP
MUL
ALU
WB
Home Stretch:
Lab #8 due Thursday 5/7;
Quiz 5 FRIDAY 5/8!
5/5/09
6.004 – Spring 2009
modified 5/4/09 10:06
L23 – Pipeline Issues 1
RF
IR
RF
RA1
SUB(r0,r4,r5)
RD1
ALU
IR
Register
File
A
ALU
RA2
RD2
Bypass
muxes
B
A
i+4
i+5
i+6
SUB
SOLUTION #1:
Deal with it in SOFTWARE; expose the pipeline for
all to see.
SOLUTION #2:
Add special hardware to maintain the sequential
execution semantics of the ISA.
5/5/09
6.004 – Spring 2009
Bypass Paths
IF
ADD(r1, r2, r3)
CMPLEC(r3, 100, r0)
MULC(r3, 100, r4)
SUB(r0, r4, r5)
L23 – Pipeline Issues 2
Load Hazards
Add special data paths, called
BYPASSES, that route the
results of the ALU and WB
stages to the RF stage, thus
substituting the register’s old
contents with a value that will
be written to that register at
some point in the future.
Consider LOADS:
Can we fix all these problems
using bypass paths?
LD
B
LD(r1, 0, r4)
ADD(r4, r1, r5)
XOR(r3, r4, r6)
ADD
XOR
LD
ADD
XOR
LD
ADD
XOR
LD
ADD
ALU
MULC(r3,100,r4)
Y
Detection of these cases has to
be incorporated into the
decoding logic of the RF stage,
which basically looks at the
instructions in the ALU and WB
stage to see if their destination
register matches a source
register reference.
WB
WB
Y
IR
WB
CMPLEC(r3,100,r0)
WA
Register
File
WD
WE
XOR
The hazard between the XOR and the LD can be addressed by
our established bypass paths…
But there are some problems that BYPASSING CAN’T FIX!
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 3
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 4
ILL
XAdr OP JT
PCSEL
4
3
2
1
Load Hazard (easy)
0
IF
00
PC
A
D
+4
Structural Data Hazard
Instruction
Memory
Instruction
Fetch
RF
RF
IR
PC
Ra <20:16>
+
Rb: <15:11>
The XOR hazard is pretty easy, but…
Rc <25:21>
RA2SEL
C: <15:0> << 2
sign-extended
RA1
WA
RD1
XOR(r3, r4, r6)
Z
Register RA2
File RD2
Register
File
JT
C: <15:0>
sign-extended
<PC>+C
ASEL
1
ALU
ALU
IR
PC
1
0
A
BSEL
0
ALU
D
B
A
B
ALU
ALUFN
Y
MEM
MEM
MEM
IR
PC
ALU
MEM
Y
D
Adr
WD
R/W
Data Memory
RD
LD(r1, 0, r4)
The XOR operand r4
can simply be
bypassed from the
output of the
memory in the WB
stage to the RF
stage… by our
normal bypass path.
Rc <25:21>
WASEL
0
1
0
WA
1
2
Write
Back
WDSEL
RegisterWD
File
WE
XOR
LD
ADD
???
LD
XOR
ADD
XOR
LD
ADD
XOR
WERF
5/5/09
6.004 – Spring 2009
ADD
How do
we fix
this
one?
In a 4-stage pipeline, for a LD instruction fetched during clock i, the
data from memory isn’t returned from memory until late into cycle i+3.
Bypassing can fix the XOR but not ADD!
XP
(NB: SAME RF
AS ABOVE!)
LD
LD(r1, 0, r4)
ADD(r1, r4, r5)
XOR(r3, r4, r6)
L23 – Pipeline Issues 5
5/5/09
6.004 – Spring 2009
L23 – Pipeline Issues 6
ILL
XAdr OP JT
PCSEL
4
3
2
1
Load Hazard (hard)
0
IF
00
PC
A
D
+4
Instruction
Fetch
RF
RF
IR
PC
Ra <20:16>
+
Rb: <15:11>
RA1
WA
RD1
Z
Register RA2
File RD2
C: <15:0>
sign-extended
ASEL
ALU
Register
File
JT
<PC>+C
PC
Rc <25:21>
RA2SEL
C: <15:0> << 2
sign-extended
ADD(r4, r1, r5)
ALU
IR
1
1
0
A
0
BSEL
ALU
D
B
A
ALUFN
LD(r1, 0, r4)
B
ALU
Y
MEM
PC
MEM
ALU
MEM
IR
D
Adr
WD
R/W
RD
(NB: SAME RF
AS ABOVE!)
The r4 operand to
the ADD instruction
hasn’t yet been
fetched from
memory.
0
1
WA
0
1
2
RegisterWD
File
WE
LD
It exists NOWHERE
on our data paths –
we can’t solve this
problem by
bypassing!
WDSEL
LD(r1, 0, r4)
ADD(r1, r4, r5)
XOR(r3, r4, r6)
ADD
XOR
XOR
LD
ADD
ADD
XOR
LD
NOP
ADD
XOR
LD
NOP
ADD
XOR
If the compiler knows about a machine’s load delay, it can often
rearrange code sequences to eliminate such hazards. Many compilers
provide machine-specific instruction scheduling.
XP
Rc <25:21>
WASEL
Bypassing can’t fix the problem
with ADD since the data simply
isn’t available! We have to add
some pipeline interlock hardware to
stall ADD’s execution.
MEM
Y
Data Memory
6.004 – Spring 2009
Load Delay
Instruction
Memory
Write
Back
WERF
5/5/09
L23 – Pipeline Issues 7
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 8
ILL
XAdr OP JT
PCSEL
4
2
3
LE
1
Stall Logic
0
IF
00
PC
STALL
D
+4
LE
RF
PC
Memory Timing & Pipelining
Instruction
Memory
A
Instruction
Fetch
But, but, what about FASTER processors?
RF
IR
LE
Ra <20:16>
Rb: <15:11>
Rc <25:21>
FACT: Processors have become very fast relative to memories!
And this gap continues to grow…
RA2SEL
+
C: <15:0> << 2
sign-extended
RA1
WA
RD1
ADD(r4, r1, r5)
Z
Register RA2
File RD2
Register
File
JT
C: <15:0>
sign-extended
<PC>+C
NOP
AnnulRF
1 0
ASEL
1
ALU
ALU
IR
PC
1
0
A
A
ALU
D
B
ALU
ALUFN
LD(r1, 0, r4)
NOP
BSEL
0
B
Y
MEM
MEM
MEM
IR
PC
ALU
MEM
Y
D
Adr
WD
(1) Freeze IF, RF
stages
(2) Introduce
NOP into ALU
stage
(3) Wait until
operand is
available
R/W
Data Memory
LD(r1, 0, r4)
RD
Rc <25:21>
WASEL
0
1
0
WA
1
2
Write
Back
WDSEL
RegisterWD
File
WE
WERF
5/5/09
6.004 – Spring 2009
ALTERNATIVE: Longer pipelines.
1. Add “MEMORY WAIT” stages between START of read operation &
return of data.
2. Build pipelined memories, so that multiple (say, N) memory
transactions can be in progress at once.
These steps add load delay slots; hence
3. Stall pipeline on unbypassable load delays.
A 4-Stage pipeline requires READ access in less than one clock.
A 5-Stage pipeline would allow nearly two clocks...
XP
(NB: SAME RF
AS ABOVE!)
Do we just increase the clock period to accommodate this bottleneck
component?
L23 – Pipeline Issues 9
ILL
XAdr OP JT
PCSEL
4
3
5/5/09
6.004 – Spring 2009
L23 – Pipeline Issues 10
ILL
XAdr OP JT
2
1
0
IF
• Omits some detail
• NO bypass or interlock logic
00
PC
A
Instruction
Memory
D
+4
Instruction
Fetch
IRRF
RF
PC
Ra <20:16>
+
Rb: <15:11>
C: <15:0> << 2
sign-extended
RA1
WA
RD1
Z
ALU
IR
1
0
1
A
• Omits some detail
• NO bypass or interlock logic
00
A
Instruction
Memory
D
+4
Instruction
Fetch
NOP
IRRF
RF
PC
Register
File
0
RA1
WA
RD1
Z
NOP
ALU
ALU
D
PC
IR
ASEL
ALU
IR
0
1
A
MEM
MEM
PC
R/W
ALU
D
B
MEM
ALU
MEM
IR
MEM
Y
D
Memory
Adr
WB
WB
PC
WB
0
1
WA
6.004 – Spring 2009
0
1
2
Data Memory
RD
Register WD
File
WE
Write
Back
WDSEL
WERF
5/5/09
R/W
XP
Data needed right before
rising clock edge at end of
Write Back pipe stage
L23 – Pipeline Issues 11
Rc <25:21>
WASEL
0
1
WA
6.004 – Spring 2009
• added IRIF mux to
annul branch-slot
instructions
• added A/B bypass
muxes to get data
before it’s written to
regfile
Y
almost 2 clock cycles
RD
XP
Rc <25:21>
WASEL
WD
WB
IR
We wanted a simple,
clean pipeline but…
BSEL
ALU
Address available right
after instruction enters
Memory pipe stage
Y
Data Memory
0
Y
D
WD
Register
File
B
A
ALUFN
Memory
WB
1
ALU
Y
IR
Register
File RD2
JT
B
MEM
Rc <25:21>
RA2
C: <15:0>
sign-extended
ALU
MEM
Rb: <15:11>
5-stage
pipeline
RA2SEL
C: <15:0> << 2
sign-extended
<PCRF>+4+C*4
BSEL
B
Adr
WB
0
Ra <20:16>
Y
PC
1
IF
PC
+
Register
File RD2
A
PC
2
RA2
C: <15:0>
sign-extended
ALUFN
MEM
3
Rc <25:21>
JT
ASEL
ALU
4
RA2SEL
<PCRF>+4+C*4
PC
5-stage
Pipeline
PCSEL
0
1
2
WDSEL
Register WD
File
WE
WERF
Write
Back
5/5/09
• added LE/muxes to
freeze IF/RF stage so
we can wait for LD to
reach WB stage
L23 – Pipeline Issues 12
RF-stage Bypass Details
1, 2, 3, 4… Hum,
it looks like we
have a few extra
inputs on that
mux
Note: can use
“distributed” mux
built from tristate
drivers.
Register File
RD1
RD2
from ALU/MEM/WB
Bypass Implementation
from ALU/MEM/WB
A Bypass
B Bypass
Part 2 of the
WDSEL mux
Part 1 of the
WDSEL mux
from mem
YMEM
Bypass
From WB
to regs
•
from regs
• •
•
Bypass
From MEM
•
• •
•
YALU
Bypass
From ALU
•
•
•
•
A
B
alu
PCALU
•
•
to alu
JT
PC+4+4*SXT(C)
Z
1
0
SXT(C)
1
ASEL
A
0
Select “0”
BSEL
B
We’ve been a little
sloppy about this
detail
To ALU
D
To ALU
5/5/09
6.004 – Spring 2009
To Mem
L23 – Pipeline Issues 13
Bypass Logic
To reduce the amount of bypass logic, the WDSEL mux has been
split: choice between ALU and PC+4 is made in ALU stage, choice
between ALU/PC and MEMDATA is made is WB stage.
5/5/09
6.004 – Spring 2009
Linkage Register Write Timing
| The code: Assume Reg[LP] = 100...
ADD(r31, r31, LP)
BR(f, LP)
| Reg[LP] = .+4
x:
SUB(LP, 1, LP)
...
f:
XOR(LP, r31, r0)
OR(r31, LP, r1)
ADD(r31, LP, r2)
Beta Bypass logic (need two copies for A/B data):
Ra or Rb/Rc
Regfile (no bypass)
Rc (WB)*
WB bypass
Rc (MEM)*
MEM bypass
IF
ALU bypass
Rc (ALU)*
RF
Select “0”
“31”
ALU
MEM
5-bit
compare
WB
* If instruction is a ST (doesn’t write into regfile), set RC for ALU/MEM/WB to R31
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 14
L23 – Pipeline Issues 15
i
i+1
ADD
BR
ADD
i+2
i+4
SUB XOR OR
BR NOP XOR
ADD BR NOP
ADD BR
ADD
BR Decision Time
6.004 – Spring 2009
i+3
ADD writes
5/5/09
i+5
ADD
OR
XOR
NOP
BR
i+6
ADD
OR
XOR
NOP
Can we make
XOR’s regfile
access work
by bypassing?
BR writes
L23 – Pipeline Issues 16
ILL
XAdr OP
PCSEL
4
3
JT
2
1
0
STALL
PCSEL
00
A
Instruction
Memory
+4
D
NOP
Instruction
Fetch
AnnulIF
0 1
STALL
STALL
IRRF
PCRF
Ra:<20:16>
PCRF+4+4*SXT(C)
+
RA1
RD1
XOR(r28…)
Rb:<15:11>
Rc:<25:21>
0
1
Register
File
RA2
RA2SEL
RD2
BYPASSES
NOP
BYPASSES
Register
File
JT SXT(C)
PCRF+4+4*SXT(C)
For BR/JMP, Rc value
is taken from PC, not
ALU.
Z
AnnulRF
1 0
PCALU
1
IRALU
0
ASEL
1
A
A,B Bypass
DALU
B
A
ALU
ALUFN
NOP
BSEL
0
B
Y
ALU
A, B BYPASS
PCMEM
Memory
A,B Bypass
Adr
BR(…,r28)
WD
R/W
A, B BYPASS
PCWB
IRWB
Data
Memory
YWB
RD
0 1 2
Register WD
File
WE
WERF
6.004 – Spring 2009
ILL
XAdr OP
4
3
2
1
PCSEL
5/5/09
L23 – Pipeline Issues 17
| Operating System: UUO Handler
IllOp: ST(r0,...) | Save a reg,
LD(xp,-4,r0) | Fetch bad instr
...
LD(...,r0) | Restore regs,
JMP(xp)
| Return to pgm.
5/5/09
6.004 – Spring 2009
Illegal Opcode Traps
0
00
A
+4
L23 – Pipeline Issues 18
Instruction
Memory
Taking Exception…
D
NOP
0 1
STALL
Instruction
Fetch
AnnulIF
STALL
IRRF
PCRF
Ra:<20:16>
PCRF+4+4*SXT(C)
BAD BNE(…,XP)
+
RA1
RD1
Rb:<15:11>
Rc:<25:21>
0
1
Register
File
RA2
RA2SEL
RD2
BYPASSES
NOP
BYPASSES
Register
File
Z
AnnulRF
2 1 0
1
IRALU
0
ASEL
A
ALUFN
1
0
• PC address of IllOp
handler
BSEL
DALU
B
B
A
A, B BYPASS
Y
ALU
• Annul instruction in IF
DMEM
YMEM
IRMEM
In general, we’d like to annul ALL instructions following one that causes a
trap or fault:
FREEZE state at time of exception, for inspection by handler code.
ILLEGAL INSTRUCTIONS are recognizable in RF stage of pipe; are ALL faults
& traps?
CONSIDER:
ALU
A, B BYPASS
PCMEM
Bad opcode decoded in
RF stage:
JT SXT(C)
PCRF+4+4*SXT(C)
BNE(R31,0,XP)
PCALU
ARITHMETIC EXCEPTIONS: divide by zero, etc.
Memory
A, B BYPASS
Adr
WD
R/W
A, B BYPASS
PCWB
IRWB
Data
Memory
YWB
• Force BNE(R31,0,XP)
in RF stage will save
PC+4 in XP when it
reaches WB stage
• Caught by ALU subsystem, during processing of data in ALU
stage
MEMORY FAULTS: Program reference to illegal memory location...
• Caught by MEMORY subsystem, during processing of address
input in MEM stage
RD
0 1 2
Rc:<25:21>
WDSEL
Write
Back
A, B BYPASS
WA
6.004 – Spring 2009
| Illegal instr.
JT
STALL
??? NOP
IMPLEMENTATION: On Bad Opcode
(discovered in RF Stage):
• Select IllOp adr as next PC
• Annul instruction in IF stage
• Substitute BNE(r31,0,XP) for
bad instruction - will (eventually)
store PC+4 into XP ... need
bypass paths to make XP usable
immediately by code at IllOp!
| User program:
...
BAD(...)
r:
...
Write
Back
A, B BYPASS
PCSEL
PCWB is already taken
care of if we bypass
WB stage from
output of WDSEL
mux.
IDEA: TRAP illegal instructions to a
special routine in the Operating
System, which can
• Interpret them in software; or
• Print humane error report.
WDSEL
Rc:<25:21>
WA
So we have to add
bypass paths for
PCALU and PCMEM.
DMEM
YMEM
IRMEM
Unused Opcode Traps
BR/JMP
PC bypass
Register WD
File
WE
WERF
5/5/09
L23 – Pipeline Issues 19
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 20
ILL
XAdr OP
PCSEL
4
3
JT
2
1
Annulment Logic
0
STALL
PCSEL
00
A
Instruction
Memory
+4
Asynchronous I/O Interrupts
D
NOP
Instruction
Fetch
AnnulIF
0 1
STALL
STALL
IRRF
PCRF
Ra:<20:16>
PCRF+4+4*SXT(C)
+
Rb:<15:11>
RA1
RD1
Rc:<25:21>
0
1
Register
File
RA2
RD2
BYPASSES
NOP
BYPASSES
Register
File
• PC address of fault
handler
ALU
• Force BNE(R31,0,XP)
in ??? stage will save
PC+4 in XP when it
reaches WB stage
JT SXT(C)
PCRF+4+4*SXT(C)
BNE(R31,0,XP)
Z
AnnulRF
PCALU
2 1 0
1
IRALU
A, B BYPASS
0
ASEL
1
0
A
B
A
B
BSEL
DALU
ALU
ALUFN
Y
NOP
BNE(R31,0,XP)
AnnulALU
PCMEM
2 1 0
A, B BYPASS
DMEM
YMEM
IRMEM
Fault in ??? stage:
Memory
A, B BYPASS
MEM
FAULT
Adr
WD
Take, for example,
| The interrupted code:
...
Interrupt
ADD(...)
Taken
SUB(...)
HERE
MUL(...)
XOR(...)
...
R/W
NOP
BNE(R31,0,XP)
AnnulMEM
PCWB
2 1 0
• Annul all following
instructions (those
earlier in the pipeline):
called “flushing the pipe”
A, B BYPASS
IRWB
Data
Memory
YWB
RD
0 1 2
Rc:<25:21>
WDSEL
Register WD
File
WE
6.004 – Spring 2009
WERF
5/5/09
L23 – Pipeline Issues 21
Interrupt
Occurs PC = Xadr??
i
ADD
RF
ALU
MEM
WB
i+1
OR
ADD
...
ADD(...)
SUB(...)
MUL(...)
XOR(...)
...
OR(...)
...
JMP(xp)
Interrupt
Taken
HERE
5/5/09
6.004 – Spring 2009
Asynchronous Interrupt Timing
xh:
...
OR
ADD
i+3
i+4
Alternative: When taking interrupt,
• ANNUL instruction in IF stage... BUT
instead of changing it to a NOP, change it to BNE(r31,0,XP)
This will cause PC+4 of annulled instruction to be written to XP!
| interrupt handler
i+5
• CODE HANDLER to return to Reg[XP]-4 (since the annulled
instruction is never executed)
PROBLEM: When
does old PC+4
get written to
XP???
...
ADD
OR
ADD
IF
ADD
RF
ALU
MEM
...
OR
i+1
i+2
i+3
OR
BNE
...
...
...
i+4
i+5
i+6
i+6
...
...
OR
L23 – Pipeline Issues 22
Making Interrupts Work
i
i+2
Can this work???
Let’s find out…
| The interrupt handler:
xh:
OR(...)
...
JMP(xp)
Write
Back
A, B BYPASS
WA
IF
Suppose key struck, interrupt
requested (via IRQ) during the
fetch of ADD. Then let’s
• Select XAdr (handler) as next
PC
• Leave ADD in pipeline; NO
annulment!
• Code handler to return to SUB
instruction.
This should be easy.
RA2SEL
OR
BNE
OR
BNE
WB
...
...
OR
...
BNE
OR
Interrupt taken
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 23
6.004 – Spring 2009
5/5/09
L23 – Pipeline Issues 24
ILL
XAdr OP
“Smart” Interrupt Handler
PCSEL
4
3
5-stage Pipeline:
Final Version
JT
2
1
0
STALL
PCSEL
00
A
+4
Instruction
Memory
D
NOP
BNE(R31,0,XP)
0 1 2
STALL
IRRF
PCRF
| The interrupted code:
...
Interrupt taken HERE,
ADD(...)
ADD instruction annulled
SUB(...)
MUL(...)
XOR(...)
...
| The interrupt handler:
xh:
OR(...)
...
SUBC(xp,4,xp)
| Adjust XP so
we
JMP(xp)
| return to annulled
| instruction (ADD)
Instruction
Fetch
AnnulIF
STALL
Ra:<20:16>
PCRF+4+4*SXT(C)
+
RA1
RD1
Rb:<15:11>
0
1
Register
File
RA2
RA2SEL
RD2
BYPASSES
NOP
BYPASSES
PCALU
2 1 0
1
IRALU
A, B BYPASS
0
ASEL
1
0
A
B
A
B
ALUFN
NOP
BSEL
DALU
ALU
Y
ALU
BNE(R31,0,XP)
AnnulALU
PCMEM
2 1 0
A, B BYPASS
DMEM
YMEM
IRMEM
Memory
Adr
AnnulMEM
PCWB
2 1 0
Data
Memory
YWB
• Can bypass results from
ALU, MEM and WB back
to RF
RD
L23 – Pipeline Issues 25
Register WD
File
WE
6.004 – Spring 2009
Pipeline Review
WDSEL
Write
Back
WERF
5/5/09
L23 – Pipeline Issues 26
RISC = Simplicity???
• memory data available only in WB stage
Introduce NOPs at IRALU, stall IF
and RF stages until LD result ready
“The P.T. Barnum World’s Tallest Dwarf Competition”
World’s Most Complex RISC?
• handle RF/ALU/MEM stage exceptions
Save PC+4 in XP (fake a BR)
annul following insts. (those earlier
in pipeline)
2-Stage pipeline:
• increased throughput (<2x)
• introduced branch delay slots
Choice of executing or
annulling inst. after branch
5-stage pipeline:
Addressing
features, eg
index registers
• implement interrupts
Throw away IF inst., save PC+4 in
XP, fix return
Pipelines, Bypasses,
Annulment, …, ...
5/5/09
L23 – Pipeline Issues 27
Primitive Machines:
direct
implementations
Generalization of
registers and
operand coding
• extra HW due to pipelining
Registers to hold values between
stages
Data bypass muxes in RF stage
Inst. muxes for “rewriting” code to
annul or save PC
• branch delay slots
• delayed register writeback
(3 cycles)
Add bypass paths (10) to
access correct value
?
VLIWs,
Super-Scalars
• increased throughput (3x???)
6.004 – Spring 2009
R/W
A, B BYPASS
IRWB
A, B BYPASS
• 1 cycle/instruction
• long cycle time: mem+regs+alu
+mem
WD
NOP
BNE(R31,0,XP)
WA
Simple unpipelined Beta:
• Can stall IF and RF
stages while waiting for
LD result to reach WB
A, B BYPASS
0 1 2
5/5/09
• Can force instruction to
BNE(R31,0,XP) in each
stage will save PC+4
in XP when it reaches WB
stage
Z
AnnulRF
Rc:<25:21>
6.004 – Spring 2009
Register
File
JT SXT(C)
PCRF+4+4*SXT(C)
BNE(R31,0,XP)
• Can annul instruction at
each stage
Rc:<25:21>
RISCs
6.004 – Spring 2009
Complex instructions,
addressing modes
5/5/09
L23 – Pipeline Issues 28
Download