2009SuCS61C-L10-jh-p..

advertisement
inst.eecs.berkeley.edu/~cs61c
CS61CL : Machine Structures
Lecture #9 – Single Cycle CPU Design
2009-07-22
Jeremy Huddleston
CS61CL L10 CPU II: Control & Pipeline (1)
Huddleston, Summer 2009 © UCB
Review: A Single Cycle Datapath
Instruction<31:0>
Rs Rt Rd Imm16
ALUctr
MemtoReg
Rd Rt
1
0
RegWr
5
Rs Rt
5
busB
32
16
Extender
clk
imm16
busA
ExtOp
CS61CL L10 CPU II: Control & Pipeline (2)
32
=
ALU
RegFile
32
MemWr
5
Rw Ra Rb
busW
<0:15>
RegDst
<11:15>
clk
<16:20>
instr
fetch
unit
nPC_sel
<21:25>
• We have
everything
except control
signals
0
32
1
32
Data In
clk
32
0
WrEn Adr
Data
Memory
1
ALUSrc
Huddleston, Summer 2009 © UCB
Recap: Meaning of the Control Signals
• nPC_sel:
“n”=next
“+4” 0  PC <– PC + 4
“br” 1  PC <– PC + 4 + {SignExt(Im16) , 00 }
• Later in lecture: higher-level connection
between mux and branch condition
Inst Address
00
0
PC
Mux
Adder
PC Ext
imm16
CS61CL L10 CPU II: Control & Pipeline (3)
nPC_sel
Adder
1
1
clk
Huddleston, Summer 2009 © UCB
Recap: Meaning of the Control Signals
° MemWr: 1  write memory
• ExtOp: “zero”, “sign”
° MemtoReg: 0  ALU; 1  Mem
• ALUsrc: 0  regB;
1  immed
° RegDst: 0  “rt”; 1  “rd”
• ALUctr: “ADD”, “SUB”, “OR” ° RegWr: 1  write register
MemtoReg
ALUctr
MemWr
RegDst Rd Rt
1
RegWr
0
Rs Rt
5
5
5
Rw Ra Rb
RegFile
32
busB
32
imm16
16
ExtOp
CS61CL L10 CPU II: Control & Pipeline (4)
Extender
clk
32
0
ALU
busW
busA
32
0
32 WrEn Adr
1
32
Data In
ALUSrc clk
Data
Memory
1
Huddleston, Summer 2009 © UCB
Instruction Fetch Unit at the Beginning of Add
• Fetch the instruction from Instruction
memory: Instruction = MEM[PC]
• same for
all instructions
Inst
Memory
nPC_sel
Inst Address
00
Adder
1
Instruction<31:0>
PC
Mux
Adder
PC Ext
clk
imm16
CS61CL L10 CPU II: Control & Pipeline (5)
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Add
31
26
op
21
16
rs
11
rt
rd
6
shamt
funct
R[rd] = R[rs] + R[rt]
5
5
5
busA
Rw Ra Rb
RegFile
32
busB
32
imm16
16
Extender
clk
ExtOp=x
CS61CL L10 CPU II: Control & Pipeline (6)
32
Rs Rt Rd Imm16
ALUctr=ADD
MemtoReg=0
MemWr=0
=
ALU
busW
0
32
1
32
<0:15>
Rs Rt
<11:15>
RegWr=1
<16:20>
0
<21:25>
Rd Rt
1
Instruction<31:0>
instr
fetch
unit
nPC_sel=+4
RegDst=1
clk
0
Data In
ALUSrc=0
clk
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
Instruction Fetch Unit at the End of Add
• PC = PC + 4
• This is the same for all instructions except:
Branch and Jump
Inst
Memory
nPC_sel=+4 Inst Address
00
Adder
1
PC
Mux
Adder
PC Ext
clk
imm16
CS61CL L10 CPU II: Control & Pipeline (7)
Huddleston, Summer 2009 © UCB
Single Cycle Datapath during Or Immediate?
31
26
21
op
16
rs
0
rt
immediate
• R[rt] = R[rs] OR ZeroExt[Imm16]
Rs Rt
5
5
5
Rw Ra Rb
RegFile
32
busA
busB
32
imm16
16
ExtOp=
CS61CL L10 CPU II: Control & Pipeline (8)
Extender
clk
32
=
Rt Rd Imm16
ALUctr=
MemtoReg=
MemWr=
ALU
busW
0
32
1
32
Data In
ALUSrc=
clk
<0:15>
RegWr=
Rs
0
<11:15>
1
clk
<16:20>
Rd Rt
<21:25>
nPC_sel=
RegDst=
Instruction<31:0>
instr
fetch
unit
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
Single Cycle Datapath during Or Immediate?
31
26
21
op
16
rs
0
rt
immediate
• R[rt] = R[rs] OR ZeroExt[Imm16]
Rs
0
RegWr=1
5
Rs Rt
5
5
busA
Rw Ra Rb
RegFile
32
32
16
Extender
clk
imm16
busB
ExtOp=zero
CS61CL L10 CPU II: Control & Pipeline (9)
Rt Rd Imm16
ALUctr=OR
MemtoReg=0
MemWr=0
=
ALU
busW
32
0
32
1
32
Data In
ALUSrc=1
<0:15>
1
<11:15>
clk
Rd Rt
<16:20>
RegDst=0
instr
fetch
unit
<21:25>
nPC_sel=+4
Instruction<31:0>
clk
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Load?
31
26
21
op
16
rs
0
rt
immediate
• R[rt] = Data Memory {R[rs] + SignExt[imm16]}
Rs Rt
5
5
5
busA
Rw Ra Rb
RegFile
32
busB
32
imm16
16
Extender
clk
ExtOp=
CS61CL L10 CPU II: Control & Pipeline (10)
32
=
Rt Rd Imm16
ALUctr=
MemtoReg=
MemWr=
ALU
busW
0
32
1
32
Data In
ALUSrc=
clk
<0:15>
RegWr=
Rs
0
<11:15>
1
clk
<16:20>
Rd Rt
<21:25>
nPC_sel=
RegDst=
Instruction<31:0>
instr
fetch
unit
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Load
31
26
21
op
16
rs
0
rt
immediate
• R[rt] = Data Memory {R[rs] + SignExt[imm16]}
5
Rs Rt
5
5
busA
Rw Ra Rb
RegFile
32
32
16
Extender
clk
imm16
busB
ExtOp=sign
CS61CL L10 CPU II: Control & Pipeline (11)
32
Rs Rt Rd Imm16
ALUctr=ADD
MemtoReg=1
MemWr=0
=
ALU
busW
0
32
1
32
<0:15>
RegWr=1
<11:15>
0
<16:20>
Rd Rt
<21:25>
nPC_sel=+4
RegDst=0
clk
1
Instruction<31:0>
instr
fetch
unit
Data In
ALUSrc=1
clk
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Store?
31
26
21
op
16
rs
0
rt
immediate
• Data Memory {R[rs] + SignExt[imm16]} = R[rt]
Rs Rt
5
5
5
busA
Rw Ra Rb
RegFile
32
busB
32
imm16
16
Extender
clk
ExtOp=
CS61CL L10 CPU II: Control & Pipeline (12)
32
=
Rt Rd Imm16
ALUctr=
MemtoReg=
MemWr=
ALU
busW
0
32
1
32
Data In
ALUSrc=
clk
<0:15>
RegWr=
Rs
0
<11:15>
1
clk
<16:20>
Rd Rt
<21:25>
nPC_sel=
RegDst=
Instruction<31:0>
instr
fetch
unit
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Store
31
26
21
op
16
rs
0
rt
immediate
• Data Memory {R[rs] + SignExt[imm16]} = R[rt]
5
Rs Rt
5
5
busA
Rw Ra Rb
RegFile
32
32
16
Extender
clk
imm16
busB
ExtOp=sign
CS61CL L10 CPU II: Control & Pipeline (13)
32
Rs Rt Rd Imm16
ALUctr=ADD
MemtoReg=x
MemWr=1
=
ALU
busW
0
32
1
32
<0:15>
RegWr=0
<11:15>
0
<16:20>
Rd Rt
<21:25>
nPC_sel=+4
RegDst=x
clk
1
Instruction<31:0>
instr
fetch
unit
Data In
ALUSrc=1
clk
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Branch?
31
26
21
op
16
rs
0
rt
immediate
• if (R[rs] - R[rt] == 0) then Zero = 1 ; else Zero = 0
Rs Rt
5
5
5
busA
Rw Ra Rb
RegFile
32
busB
32
imm16
16
Extender
clk
ExtOp=
CS61CL L10 CPU II: Control & Pipeline (14)
32
=
Rt Rd Imm16
ALUctr=
MemtoReg=
MemWr=
ALU
busW
0
32
1
32
Data In
ALUSrc=
clk
<0:15>
RegWr=
Rs
0
<11:15>
1
clk
<16:20>
Rd Rt
<21:25>
nPC_sel=
RegDst=
Instruction<31:0>
instr
fetch
unit
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Branch
31
26
21
op
16
rs
0
rt
immediate
• if (R[rs] - R[rt] == 0) then “=“ = 1 ; else “=“ = 0
Rs Rt
5
5
5
busA
Rw Ra Rb
RegFile
32
busB
32
imm16
16
Extender
clk
ExtOp=x
CS61CL L10 CPU II: Control & Pipeline (15)
32
=
Rt Rd Imm16
ALUctr=x
MemtoReg=x
MemWr=0
ALU
busW
0
32
1
32
Data In
ALUSrc=0
<0:15>
RegWr=0
<11:15>
Rs
0
<16:20>
Rd Rt
<21:25>
nPC_sel=br
RegDst=x
clk
1
Instruction<31:0>
instr
fetch
unit
clk
32
0
WrEn Adr
Data
Memory
1
Huddleston, Summer 2009 © UCB
Instruction Fetch Unit at the End of Branch
31
26
op
21
rs
16
rt
0
immediate
• if (Equals == 1) then PC = PC + 4 +
SignExt[imm16]*4 ; else PC = PC + 4
Inst
Memory
nPC_sel
Adr
Equal
MUX
ctrl
nPC_sel
• What is encoding of
nPC_sel?
• Direct MUX select?
• Branch inst. / not branch
0
00
• Let’s pick 2nd option
Mux
PC
Adder
PC Ext
imm16
Adder
1
Instruction<31:0>
1
clk
CS61CL L10 CPU II: Control & Pipeline (16)
nPC_sel
0
1
1
zero?
x
0
1
MUX
0
0
1
Q: What
logic gate?
Huddleston, Summer 2009 © UCB
How to Design a Processor: step-by-step
• 1. Analyze instruction set architecture (ISA)
 datapath requirements
• meaning of each instruction is given by the register transfers
• datapath must include storage element for ISA registers
• datapath must support each register transfer
• 2. Select set of datapath components and establish
clocking methodology
• 3. Assemble datapath meeting requirements
• 4. Analyze implementation of each instruction to
determine setting of control points that effects the
register transfer.
• 5. Assemble the control logic (hard part!)
CS61CL L10 CPU II: Control & Pipeline (17)
Huddleston, Summer 2009 © UCB
Step 4: Given Datapath: RTL  Control
Instruction<31:0>
Rd
<0:15>
Rs
<11:15>
Rt
<16:20>
Op Fun
<21:25>
<0:5>
Adr
<26:31>
Inst
Memory
Imm16
Control
nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg
DATA PATH
CS61CL L10 CPU II: Control & Pipeline (18)
Huddleston, Summer 2009 © UCB
A Summary of the Control Signals
See
Appendix A
func 10 0000 10 0010
We Don’t Care :-)
op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
add
sub
ori
lw
sw
beq
jump
RegDst
1
1
0
0
x
x
x
ALUSrc
0
0
1
1
1
0
x
MemtoReg
0
0
0
1
x
x
x
RegWrite
1
1
1
1
0
0
0
MemWrite
0
0
0
0
1
0
0
nPCsel
0
0
0
0
0
1
?
Jump
0
0
0
0
0
0
1
ExtOp
x
x
0
1
1
x
x
Add
Subtract
Or
Add
Add
Subtract
x
ALUctr<2:0>
31
26
21
16
R-type
op
rs
rt
I-type
op
rs
rt
J-type
op
CS61CL L10 CPU II: Control & Pipeline (19)
11
rd
6
shamt
immediate
target address
0
funct
add, sub
ori, lw, sw, beq
jump
Huddleston, Summer 2009 © UCB
Boolean Expressions for Controller
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
nPCsel
Jump
ExtOp
ALUctr[0]
ALUctr[1]
= add + sub
= ori + lw + sw
= lw
= add + sub + ori + lw
= sw
= beq
= jump
= lw + sw
= sub + beq (assume ALUctr is 0 ADD, 01: SUB, 10: OR)
= or
where,
rtype = ~op5  ~op4  ~op3  ~op2 
ori = ~op5  ~op4  op3  op2 
lw = op5  ~op4  ~op3  ~op2 
sw
= op5  ~op4  op3  ~op2 
beq = ~op5  ~op4  ~op3  op2 
jump = ~op5  ~op4  ~op3  ~op2 
~op1  ~op0,
~op1  op0
op1  op0
op1  op0
~op1  ~op0
op1  ~op0
How do we
implement this in
gates?
add = rtype  func5  ~func4  ~func3  ~func2  ~func1  ~func0
sub = rtype  func5  ~func4  ~func3  ~func2  func1  ~func0
CS61CL L10 CPU II: Control & Pipeline (20)
Huddleston, Summer 2009 © UCB
Controller Implementation
opcode
func
“AND” logic
CS61CL L10 CPU II: Control & Pipeline (21)
add
sub
ori
lw
sw
beq
jump
“OR” logic
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
nPCsel
Jump
ExtOp
ALUctr[0]
ALUctr[1]
Huddleston, Summer 2009 © UCB
Processor Performance
• Can we estimate the clock rate (frequency) of
our single-cycle processor? We know:
• 1 cycle per instruction
•lw is the most demanding instruction.
• Assume these delays for major pieces of the
datapath:
-
Instr. Mem, ALU, Data Mem : 2 ns each, regfile 1 ns
Instruction execution requires: 2 + 1 + 2 + 2 + 1 = 8 ns
  125 MHz
• What can we do to improve clock rate?
• Will this improve performance as well?
• We want increases in clock rate to result in
programs executing quicker.
CS61CL L10 CPU II: Control & Pipeline (22)
Huddleston, Summer 2009 © UCB
Gotta Do Laundry
• Ann, Brian, Cathy, Dave
each have one load of
clothes to wash, dry, fold,
and put away
A B C D
• Washer takes 30 minutes
• Dryer takes 30 minutes
• “Folder” takes 30 minutes
• “Stasher” takes 30 minutes to put
clothes into drawers
CS61CL L10 CPU II: Control & Pipeline (23)
Huddleston, Summer 2009 © UCB
Sequential Laundry
6 PM 7
T
a
s
k
O
r
d
e
r
A
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
• Sequential laundry takes
8 hours for 4 loads
CS61CL L10 CPU II: Control & Pipeline (24)
Huddleston, Summer 2009 © UCB
Pipelined Laundry
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
10
11
12
1
2 AM
Time
A
B
C
O
D
r
d
e
• Pipelined
r
laundry takes
3.5 hours for 4 loads!
CS61CL L10 CPU II: Control & Pipeline (25)
Huddleston, Summer 2009 © UCB
General Definitions
• Latency: time to completely execute a
certain task
• for example, time to read a sector from
disk is disk access time or disk latency
• Throughput: amount of work that can
be done over a period of time
CS61CL L10 CPU II: Control & Pipeline (26)
Huddleston, Summer 2009 © UCB
Pipelining Lessons (1/2)
6 PM
T
a
s
k
7
8
Time
30 30 30 30 30 30 30
A
B
O
r
d
e
r
9
C
D
CS61CL L10 CPU II: Control & Pipeline (27)
• Pipelining doesn’t help
latency of single task, it
helps throughput of entire
workload
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup =
Number pipe stages
• Time to “fill” pipeline and
time to “drain” it reduces
speedup:
2.3X v. 4X in this example
Huddleston, Summer 2009 © UCB
Pipelining Lessons (2/2)
6 PM
T
a
s
k
7
8
Time
30 30 30 30 30 30 30
A
B
O
r
d
e
r
9
C
D
CS61CL L10 CPU II: Control & Pipeline (28)
• Suppose new
Washer takes 20
minutes, new
Stasher takes 20
minutes. How much
faster is pipeline?
• Pipeline rate limited
by slowest pipeline
stage
• Unbalanced lengths
of pipe stages
reduces speedup
Huddleston, Summer 2009 © UCB
Steps in Executing MIPS
1) IFtch: Instruction Fetch, Increment PC
2) Dcd: Instruction Decode, Read
Registers
3) Exec:
Mem-ref: Calculate Address
Arith-log: Perform Operation
4) Mem:
Load: Read Data from Memory
Store: Write Data to Memory
5) WB: Write Data Back to Register
CS61CL L10 CPU II: Control & Pipeline (29)
Huddleston, Summer 2009 © UCB
Pipelined Execution Representation
Time
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
• Every instruction must take same number
of steps, also called pipeline “stages”, so
some will go idle sometimes
CS61CL L10 CPU II: Control & Pipeline (30)
Huddleston, Summer 2009 © UCB
+4
1. Instruction
Fetch
ALU
Data
memory
rd
rs
rt
registers
PC
instruction
memory
Review: Datapath for MIPS
imm
5. Write
2. Decode/
3. Execute 4. Memory
Back
Register Read
• Use datapath figure to represent pipeline
IFtch Dcd Exec Mem WB
CS61CL L10 CPU II: Control & Pipeline (31)
Reg
ALU
I$
D$
Reg
Huddleston, Summer 2009 © UCB
Graphical Pipeline Representation
ALU
(In Reg, right half highlight read, left half write)
Time (clock cycles)
I
n
I$
D$
Reg
Reg
Load
s
I$
D$
Reg
Reg
t Add
r.
I$
D$
Reg
Reg
Store
O
I$
D$
Reg
Reg
Sub
r
I$
D$
Reg
Reg
d Or
e
r
ALU
ALU
ALU
ALU
CS61CL L10 CPU II: Control & Pipeline (32)
Huddleston, Summer 2009 © UCB
Example
• Suppose 2 ns for memory access, 2 ns for
ALU operation, and 1 ns for register file
read or write; compute instruction rate
• Nonpipelined Execution:
•lw : IF + Read Reg + ALU + Memory + Write Reg
= 2 + 1 + 2 + 2 + 1 = 8 ns
•add: IF + Read Reg + ALU + Write Reg
= 2 + 1 + 2 + 1 = 6 ns
(recall 8ns for single-cycle processor)
• Pipelined Execution:
• Max(IF,Read Reg,ALU,Memory,Write Reg) = 2
ns
CS61CL L10 CPU II: Control & Pipeline (33)
Huddleston, Summer 2009 © UCB
Pipeline Hazard: Matching socks in later loa
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
CS61CL L10 CPU II: Control & Pipeline (34)
Huddleston, Summer 2009 © UCB
Administrivia
• Midterm Solutions
• Regrade Requests
• HW7 (Design Document)
CS61CL L10 CPU II: Control & Pipeline (35)
Huddleston, Summer 2009 © UCB
Problems for Pipelining CPUs
• Limits to pipelining: Hazards prevent next
instruction from executing during its
designated clock cycle
• Structural hazards: HW cannot support some
combination of instructions (single person to
fold and put clothes away)
• Control hazards: Pipelining of branches causes
later instruction fetches to wait for the result of
the branch
• Data hazards: Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
• These might result in pipeline stalls or
“bubbles” in the pipeline.
CS61CL L10 CPU II: Control & Pipeline (36)
Huddleston, Summer 2009 © UCB
Structural Hazard #1: Single Memory (1/2)
Time (clock cycles)
ALU
I
n
I$
D$
Reg
Reg
s Load
I$
D$
Reg
Reg
t Instr 1
r.
I$
D$
Reg
Reg
Instr 2
O
I$
D$
Reg
Reg
Instr 3
r
I$
D$
Reg
Reg
d Instr 4
e
rRead same memory twice in same clock cycle
ALU
ALU
ALU
ALU
CS61CL L10 CPU II: Control & Pipeline (37)
Huddleston, Summer 2009 © UCB
Structural Hazard #1: Single Memory (2/2)
• Solution:
• infeasible and inefficient to create
second memory
• (We’ll learn about this more next week)
• so simulate this by having two Level 1
Caches (a temporary smaller [of usually
most recently used] copy of memory)
• have both an L1 Instruction Cache and
an L1 Data Cache
• need more complex hardware to control
when both caches miss
CS61CL L10 CPU II: Control & Pipeline (38)
Huddleston, Summer 2009 © UCB
Structural Hazard #2: Registers (1/2)
Time (clock cycles)
I$
D$
Reg
Reg
D$
Reg
I$
Reg
ALU
Reg
ALU
I$
ALU
I
n
s
t sw
r. Instr 1
D$
Reg
ALU
O Instr 2
r
I$
D$
Reg
Reg
Instr 3
d
I$
D$
Reg
Reg
e Instr 4
r
Can we read and write to registers simultaneously?
ALU
CS61CL L10 CPU II: Control & Pipeline (39)
Huddleston, Summer 2009 © UCB
Structural Hazard #2: Registers (2/2)
• Two different solutions have been used:
1) RegFile access is VERY fast: takes less
than half the time of ALU stage
- Write to Registers during first half of each
clock cycle
- Read from Registers during second half of
each clock cycle
2) Build RegFile with independent read and
write ports
• Result: can perform Read and Write
during same clock cycle
CS61CL L10 CPU II: Control & Pipeline (40)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (1/8)
Time (clock cycles)
ALU
I
n
I$
D$
Reg
Reg
beq
s
I$
D$
Reg
Reg
t Instr 1
r.
I$
D$
Reg
Reg
Instr 2
O
I$
D$
Reg
Reg
Instr 3
r
I$
D$
Reg
Reg
d Instr 4
e
r
Where do we do the compare for the branch?
ALU
ALU
ALU
ALU
CS61CL L10 CPU II: Control & Pipeline (41)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (2/8)
• We had put branch decision-making
hardware in ALU stage
• therefore two more instructions after the
branch will always be fetched, whether or
not the branch is taken
• Desired functionality of a branch
• if we do not take the branch, don’t waste
any time and continue executing
normally
• if we take the branch, don’t execute any
instructions after the branch, just go to
the desired label
CS61CL L10 CPU II: Control & Pipeline (42)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (3/8)
• Initial Solution: Stall until decision is
made
• insert “no-op” instructions (those that
accomplish nothing, just take time) or
hold up the fetch of the next instruction
(for 2 cycles).
• Drawback: branches take 3 clock cycles
each (assuming comparator is put in
ALU stage)
CS61CL L10 CPU II: Control & Pipeline (43)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (4/8)
• Optimization #1:
• insert special branch comparator in
Stage 2
• as soon as instruction is decoded
(Opcode identifies it as a branch),
immediately make a decision and set the
new value of the PC
• Benefit: since branch is complete in
Stage 2, only one unnecessary
instruction is fetched, so only one no-op
is needed
• Side Note: This means that branches are
idle in Stages 3, 4 and 5.
CS61CL L10 CPU II: Control & Pipeline (44)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (5/8)
Time (clock cycles)
ALU
I
n
I$
D$
Reg
Reg
beq
s
I$
D$
Reg
Reg
t Instr 1
r.
I$
D$
Reg
Reg
Instr 2
O
I$
D$
Reg
Reg
Instr 3
r
I$
D$
Reg
Reg
d Instr 4
e
r
Branch comparator moved to Decode stage.
ALU
ALU
ALU
ALU
CS61CL L10 CPU II: Control & Pipeline (45)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (6a/8)
ALU
I • User inserting no-op instruction
n
Time (clock cycles)
s
I$
D$
Reg
Reg
add
t
r.
I$
D$
Reg
Reg
beq
ALU
ALU
bub bub bub bub bub
O nop
ble
ble
ble
ble
ble
r
D$
Reg
Reg
I$
d lw
e
Impact: 2 clock cycles per branch instruction
r
slow
CS61CL L10 CPU II: Control & Pipeline (46)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (6b/8)
ALU
I • Controller inserting a single bubble
n
Time (clock cycles)
s
I$
D$
Reg
Reg
add
t
r.
I$
D$
Reg
Reg
beq
ALU
ALU
bub I$
O lw
D$
Reg
Reg
ble
r
d
e
Impact: 2 clock cycles per branch instruction
r
slow
CS61CL L10 CPU II: Control & Pipeline (47)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (7/8)
• Optimization #2: Redefine branches
• Old definition: if we take the branch,
none of the instructions after the branch
get executed by accident
• New definition: whether or not we take
the branch, the single instruction
immediately following the branch gets
executed (called the branch-delay slot)
• The term “Delayed Branch” means
we always execute inst after branch
• This optimization is used with MIPS
CS61CL L10 CPU II: Control & Pipeline (48)
Huddleston, Summer 2009 © UCB
Control Hazard: Branching (8/8)
• Notes on Branch-Delay Slot
• Worst-Case Scenario: can always put a
no-op in the branch-delay slot
• Better Case: can find an instruction
preceding the branch which can be
placed in the branch-delay slot without
affecting flow of the program
- re-ordering instructions is a common
method of speeding up programs
- compiler/assembler must be very smart in
order to find instructions to do this
- usually can find such an instruction at least
50% of the time
- Jumps also have a delay slot…
CS61CL L10 CPU II: Control & Pipeline (49)
Huddleston, Summer 2009 © UCB
Example: Nondelayed vs. Delayed Bran
MAL
TAL
Nondelayed Branch
or $8, $9 ,$10
Delayed Branch
add $1 ,$2,$3
add $1 ,$2,$3
sub $4, $5,$6
sub $4, $5,$6
beq $1, $4, Exit
beq $1, $4, Exit
or
xor $10, $1,$11
xor $10, $1,$11
Exit:
CS61CL L10 CPU II: Control & Pipeline (50)
$8, $9 ,$10
Exit:
Huddleston, Summer 2009 © UCB
Data Hazards (1/2)
• Consider the following sequence of
instructions
add $t0, $t1, $t2
sub $t4, $t0 ,$t3
and $t5, $t0 ,$t6
or
$t7, $t0 ,$t8
xor $t9, $t0 ,$t10
CS61CL L10 CPU II: Control & Pipeline (51)
Huddleston, Summer 2009 © UCB
Data Hazards (2/2)
• Data-flow backward in time are
hazards
Reg
D$
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
CS61CL L10 CPU II: Control & Pipeline (52)
D$
ALU
r
or $t7,$t0,$t8
d
e xor $t9,$t0,$t10
r
I$
WB
ALU
O and $t5,$t0,$t6
MEM
ALU
ALU
Time (clock cycles)
I
n
IF
ID/RF EX
s add $t0,$t1,$t2 I$ Reg
t
I$
Reg
r. sub $t4,$t0,$t3
D$
Reg
Huddleston, Summer 2009 © UCB
Data Hazard Solution: Forwarding
• Forward result from one stage to another
and $t5,$t0,$t6
or $t7,$t0,$t8
xor $t9,$t0,$t10
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
I$
sub $t4,$t0,$t3
D$
ALU
Reg
WB
ALU
I$
EX MEM
ALU
ID/RF
ALU
add $t0,$t1,$t2
IF
D$
Reg
“or” hazard solved by register hardware
CS61CL L10 CPU II: Control & Pipeline (53)
Huddleston, Summer 2009 © UCB
Data Hazard: Loads (1/4)
• Dataflow backwards in time are hazards
I$
Reg
I$
sub $t3,$t0,$t2
EX
MEM
WB
D$
Reg
Reg
ALU
ID/RF
ALU
lw $t0,0($t1)
IF
D$
Reg
• Can’t solve all cases with forwarding
• Must stall instruction dependent on load,
then forward (more hardware)
CS61CL L10 CPU II: Control & Pipeline (54)
Huddleston, Summer 2009 © UCB
Data Hazard: Loads (2/4)
• Hardware stalls pipeline
• Called “interlock”
I$
sub $t3,$t0,$t2
and $t5,$t0,$t4
or $t7,$t0,$t6
CS61CL L10 CPU II: Control & Pipeline (55)
WB
D$
Reg
Reg
bub
ble
D$
Reg
D$
Reg
ALU
Reg
MEM
ALU
I$
EX
ALU
ID/RF
ALU
lw $t0, 0($t1)
IF
I$
bub
ble
bub I$
ble
Reg
Reg
D$
Huddleston, Summer 2009 © UCB
Data Hazard: Loads (3/4)
• Instruction slot after a load is called
“load delay slot”
• If that instruction uses the result of the
load, then the hardware interlock will
stall it for one cycle.
• If the compiler puts an unrelated
instruction in that slot, then no stall
• Letting the hardware stall the instruction
in the delay slot is equivalent to putting
a nop in the slot (except the latter uses
more code space)
CS61CL L10 CPU II: Control & Pipeline (56)
Huddleston, Summer 2009 © UCB
Data Hazard: Loads (4/4)
• Stall is equivalent to nop
and $t5,$t0,$t4
or $t7,$t0,$t6
CS61CL L10 CPU II: Control & Pipeline (57)
I$
Reg
bub
ble
bub
ble
bub
ble
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
sub $t3,$t0,$t2
bub bub
ble ble
D$
ALU
nop
Reg
ALU
I$
ALU
lw $t0, 0($t1)
D$
Huddleston, Summer 2009 © UCB
Summary: Single-cycle Processor
°5 steps to design a processor
• 1. Analyze instruction set  datapath requirements
• 2. Select set of datapath components & establish clock
methodology
• 3. Assemble datapath meeting the requirements
• 4. Analyze implementation of each instruction to
determine setting of control points that effects the
register transfer.
• 5. Assemble the control logic
Processor
• Formulate Logic Equations
• Design Circuits
Control
Memory
Datapath
CS61CL L10 CPU II: Control & Pipeline (58)
Input
Output
Huddleston, Summer 2009 © UCB
Things to Remember
• Optimal Pipeline
• Each stage is executing part of an
instruction each clock cycle.
• One instruction finishes during each clock
cycle.
• On average, execute far more quickly.
• What makes this work?
• Similarities between instructions allow us to
use same stages for all instructions
(generally).
• Each stage takes about the same amount of
time as all others: little wasted time.
CS61CL L10 CPU II: Control & Pipeline (59)
Huddleston, Summer 2009 © UCB
“And in Conclusion..”
• Pipeline challenge is hazards
• Forwarding helps w/many data hazards
• Delayed branch helps with control
hazard in 5 stage pipeline
• Load delay slot / interlock necessary
• More aggressive performance:
• Superscalar
• Out-of-order execution
CS61CL L10 CPU II: Control & Pipeline (60)
Huddleston, Summer 2009 © UCB
Bonus slides
• These are extra slides that used to be
included in lecture notes, but have
been moved to this, the “bonus” area
to serve as a supplement.
• The slides will appear in the order they
would have in the normal presentation
CS61CL L10 CPU II: Control & Pipeline (61)
Huddleston, Summer 2009 © UCB
RTL: The Add Instruction
31
26
op
6 bits
21
rs
5 bits
16
rt
5 bits
11
6
0
rd
shamt
funct
5 bits
5 bits
6 bits
add rd, rs, rt
• MEM[PC]
Fetch the instruction
from memory
• R[rd] = R[rs] + R[rt] The actual operation
• PC = PC + 4
Calculate the next
instruction’s address
CS61CL L10 CPU II: Control & Pipeline (62)
Huddleston, Summer 2009 © UCB
A Summary of the Control Signals (1/2)
inst
add
Register Transfer
R[rd]  R[rs] + R[rt];
PC  PC + 4
ALUsrc = RegB, ALUctr = “ADD”, RegDst = rd, RegWr, nPC_sel = “+4”
sub
R[rd]  R[rs] – R[rt];
PC  PC + 4
ALUsrc = RegB, ALUctr = “SUB”, RegDst = rd, RegWr, nPC_sel = “+4”
ori
R[rt]  R[rs] + zero_ext(Imm16);
PC  PC + 4
ALUsrc = Im, Extop = “Z”,ALUctr = “OR”, RegDst = rt,RegWr, nPC_sel =“+4”
lw
R[rt]  MEM[ R[rs] + sign_ext(Imm16)]; PC  PC + 4
ALUsrc = Im, Extop = “sn”, ALUctr = “ADD”,
RegDst = rt, RegWr,
sw
MemtoReg,
nPC_sel = “+4”
MEM[ R[rs] + sign_ext(Imm16)]  R[rs]; PC  PC + 4
ALUsrc = Im, Extop = “sn”, ALUctr = “ADD”, MemWr, nPC_sel = “+4”
beq
if ( R[rs] == R[rt] ) then PC  PC + sign_ext(Imm16)] || 00 else PC  PC + 4
nPC_sel = “br”, ALUctr = “SUB”
CS61CL L10 CPU II: Control & Pipeline (63)
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Jump
31
J-type
26 25
0
op
target address
jump
• New PC = { PC[31..28], target address, 00 }
Instruction<31:0>
Jump=
<0:25>
Data In32
ALUSrc =
0
32
Clk
WrEn Adr
32
Mux
32
<0:15>
1
<11:15>
16
Extender
imm16
Rs Rd Imm16 TA26
MemtoReg =
Zero MemWr =
ALU
busA
Rw Ra Rb
32
32 32-bit
Registers busB
0
32
<16:20>
5
Rt
ALUctr =
Rs Rt
5
5
Mux
32
Clk
Clk
1 Mux 0
RegWr =
busW
Rt
<21:25>
RegDst =
Rd
Instruction
Fetch Unit
nPC_sel=
1
Data
Memory
ExtOp =
CS61CL L10 CPU II: Control & Pipeline (64)
Huddleston, Summer 2009 © UCB
The Single Cycle Datapath during Jump
31
J-type
26 25
0
op
target address
jump
• New PC = { PC[31..28], target address, 00 }
Instruction<31:0>
Jump=1
<0:25>
Data In32
ALUSrc = x
0
32
Clk
WrEn Adr
32
Mux
32
<0:15>
1
<11:15>
16
Extender
imm16
Rs Rd Imm16 TA26
MemtoReg = x
Zero MemWr = 0
ALU
busA
Rw Ra Rb
32
32 32-bit
Registers busB
0
32
<16:20>
5
Rt
ALUctr =x
Rs Rt
5
5
Mux
32
Clk
Clk
1 Mux 0
RegWr = 0
busW
Rt
<21:25>
RegDst = x
Rd
Instruction
Fetch Unit
nPC_sel=?
1
Data
Memory
ExtOp = x
CS61CL L10 CPU II: Control & Pipeline (65)
Huddleston, Summer 2009 © UCB
Instruction Fetch Unit at the End of Jump
31
26 25
J-type
0
op
target address
jump
• New PC = { PC[31..28], target address, 00 }
Jump
Inst
Memory
nPC_sel
Instruction<31:0>
Adr
Zero
nPC_MUX_sel
Adder
0
PC
Mux
Adder
imm16
00
1
How do we modify this
to account for jumps?
1
Clk
CS61CL L10 CPU II: Control & Pipeline (66)
Huddleston, Summer 2009 © UCB
Instruction Fetch Unit at the End of Jump
31
26 25
J-type
0
op
target address
jump
• New PC = { PC[31..28], target address, 00 }
Jump
Inst
Memory
nPC_sel
Instruction<31:0>
Adr
Zero
imm16
Mux
Adder
1
CS61CL L10 CPU II: Control & Pipeline (67)
00
TA
4 (MSBs)
1
PC
Adder
0
26
Mux
1
00
nPC_MUX_sel
0
Clk
Query
• Can Zero still
get asserted?
• Does nPC_sel
need to be 0?
• If not, what?
Huddleston, Summer 2009 © UCB
Historical Trivia
• First MIPS design did not interlock and
stall on load-use data hazard
• Real reason for name behind MIPS:
Microprocessor without
Interlocked
Pipeline
Stages
• Word Play on acronym for
Millions of Instructions Per Second,
also called MIPS
CS61CL L10 CPU II: Control & Pipeline (68)
Huddleston, Summer 2009 © UCB
Pipeline Hazard: Matching socks in later loa
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
• A depends on D; stall since folder tied up; Note this is much
different from processor cases so far. We have not had a
earlier instruction depend on a later one.
CS61CL L10 CPU II: Control & Pipeline (69)
Huddleston, Summer 2009 © UCB
Out-of-Order Laundry: Don’t Wait
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
D
r
d E
e
r F
• A depends on D; rest continue; need more
CS61CL L10 CPU II: Control & Pipeline (70)
Huddleston, Summer 2009 © UCB
Superscalar Laundry: Parallel per stag
6 PM 7
T
a
s
k
8
9
3030 30 30 30
A
B
C
O
D
r
d E
e
r F
10
11
12
1
2 AM
Time
(light clothing)
(dark clothing)
(very dirty clothing)
(light clothing)
(dark clothing)
(very dirty clothing)
• More resources, HW to match mix of parallel
CS61CL L10 CPU II: Control & Pipeline (71)
Huddleston, Summer 2009 © UCB
Superscalar Laundry: Mismatch Mix
6 PM 7
T
a
s
k
8
9
3030 30 30 30 30 30
A
10
11
12
1
2 AM
Time
(light clothing)
O
r B
d C
e
r
D
CS61CL L10 CPU II: Control & Pipeline (72)
(light clothing)
(dark clothing)
(light clothing)
Huddleston, Summer 2009 © UCB
Download