MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science

advertisement
MIPS Pipeline
Hakim Weatherspoon
CS 3410, Spring 2013
Computer Science
Cornell University
See P&H Chapter 4.6
A Processor
Review: Single cycle processor
memory
+4
inst
register
file
+4
=?
PC
control
offset
new
pc
alu
target
imm
extend
cmp
addr
din
dout
memory
Review: Single Cycle Processor
Advantages
• Single Cycle per instruction make logic and clock simple
Disadvantages
• Since instructions take different time to finish, memory
and functional unit are not efficiently utilized.
• Cycle time is the longest delay.
– Load instruction
• Best possible CPI is 1
– However, lower MIPS and longer clock period (lower clock
frequency); hence, lower performance.
Review: Multi Cycle Processor
Advantages
• Better MIPS and smaller clock period (higher clock
frequency)
• Hence, better performance than Single Cycle processor
Disadvantages
• Higher CPI than single cycle processor
Pipelining: Want better Performance
• want small CPI (close to 1) with high MIPS and short
clock period (high clock frequency)
• CPU time = instruction count x CPI x clock cycle time
Single Cycle vs Pipelined Processor
See: P&H Chapter 4.5
The Kids
Alice
Bob
They don’t always get along…
The Bicycle
The Materials
Drill
Saw
Glue
Paint
The Instructions
N pieces, each built following same sequence:
Saw
Drill
Glue
Paint
Design 1: Sequential Schedule
Alice owns the room
Bob can enter when Alice is finished
Repeat for remaining tasks
No possibility for conflicts
Sequential Performance
time
1
2
3
4
Latency:
Elapsed Time for Alice: 4
Throughput:
Elapsed Time for Bob: 4
Concurrency:
Total elapsed time: 4*N
Can we do better?
5
6
CPI =
7
8…
Design 2: Pipelined Design
Partition room into stages of a pipeline
Dave
Carol
Bob
Alice
One person owns a stage at a time
4 stages
4 people working simultaneously
Everyone moves right in lockstep
time Pipelined Performance
1
2
3
4
5
6
7…
Latency:
Throughput:
Concurrency:
Lessons
Principle:
Throughput increased by parallel execution
Pipelining:
• Identify pipeline stages
• Isolate stages from each other
• Resolve pipeline hazards (Thursday)
A Processor
Review: Single cycle processor
memory
+4
inst
register
file
+4
=?
PC
control
offset
new
pc
alu
target
imm
extend
cmp
addr
din
dout
memory
A Processor
memory
inst
register
file
alu
+4
addr
PC
din
control
new
pc
Instruction
Fetch
imm
extend
Instruction
Decode
dout
memory
compute
jump/branch
targets
Execute
Memory
WriteBack
Basic Pipeline
Five stage “RISC” load-store architecture
1. Instruction fetch (IF)
– get instruction from memory, increment PC
2. Instruction Decode (ID)
– translate opcode into control signals and read registers
3. Execute (EX)
– perform ALU operation, compute jump/branch targets
4. Memory (MEM)
– access memory if needed
5. Writeback (WB)
– update register file
Clock cycle
add
lw
Time Graphs
1
2
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
Latency:
Throughput:
Concurrency:
3
4
5
6
7
8
9
EX MEM WB
Principles of Pipelined Implementation
Break instructions across multiple clock cycles
(five, in this case)
Design a separate stage for the execution
performed during each clock cycle
Add pipeline registers (flip-flops) to isolate signals
between different stages
Pipelined Processor
See: P&H Chapter 4.6
register
file
B
alu
D
memory
D
A
Pipelined Processor
+4
IF/ID
M
B
ID/EX
Execute
EX/MEM
Memory
ctrl
Instruction
Decode
Instruction
Fetch
dout
compute
jump/branch
targets
ctrl
extend
din
memory
imm
new
pc
control
ctrl
inst
PC
addr
WriteBack
MEM/WB
IF
Stage 1: Instruction Fetch
Fetch a new instruction every cycle
• Current PC is index to instruction memory
• Increment the PC at end of cycle (assume no branches for
now)
Write values of interest to pipeline register (IF/ID)
• Instruction bits (for later decoding)
• PC+4 (for later computing branch targets)
IF
instruction
memory
addr
mc
+4
PC
new
pc
IF
instruction
memory
addr
mc
+4
PC
new
pc
IF
instruction
memory
mc
00 = read word
1
PC+4
+4
inst
addr
WE
PC
pcreg
new
pc
pcsel
pcrel
pcabs
IF/ID
Rest of pipeline
1
ID
Stage 2: Instruction Decode
On every cycle:
• Read IF/ID pipeline register to get instruction bits
• Decode instruction, generate control signals
• Read from register file
Write values of interest to pipeline register (ID/EX)
• Control information, Rd index, immediates, offsets, …
• Contents of Ra, Rb
• PC+4 (for computing branch targets later)
ctrl PC+4 imm
inst
PC+4
Stage 1: Instruction Fetch
WE
Rd register
D
file
A
A
IF/ID
ID/EX
decode
extend
Rest of pipeline
B
Ra Rb
B
ID
result
dest
EX
Stage 3: Execute
On every cycle:
•
•
•
•
Read ID/EX pipeline register to get values and control bits
Perform ALU operation
Compute targets (PC+4+offset, etc.) in case this is a branch
Decide if jump/branch should be taken
Write values of interest to pipeline register (EX/MEM)
• Control information, Rd index, …
• Result of ALU operation
• Value in case this is a memory store instruction
ctrl
ctrl
PC+4
+
||
j
pcrel
D
A
pcsel
Rest of pipeline
B
B
alu
target
imm
Stage 2: Instruction Decode
pcreg
EX
branch?
pcabs
ID/EX
EX/MEM
MEM
Stage 4: Memory
On every cycle:
• Read EX/MEM pipeline register to get values and control bits
• Perform memory load/store if needed
– address is ALU result
Write values of interest to pipeline register (MEM/WB)
• Control information, Rd index, …
• Result of memory operation
• Pass result of ALU operation
pcsel
MEM
branch?
memory
Rest of pipeline
D
pcrel
dout
mc
pcabs
ctrl
target
B
din
M
addr
ctrl
Stage 3: Execute
D
pcreg
EX/MEM
MEM/WB
WB
Stage 5: Write-back
On every cycle:
• Read MEM/WB pipeline register to get values and control bits
• Select value and write to register file
ctrl
M
Stage 4: Memory
D
result
dest
MEM/WB
WB
D
M
addr
din dout
EX/MEM
Rd
OP
Rd
mem
OP
ID/EX
B
D
A
B
Rt Rd PC+4
IF/ID
OP
PC+4
+4
PC
B
Ra Rb
imm
inst
inst
mem
A
Rd
D
MEM/WB
Administrivia
Required: partner for group project
Project1 (PA1) and Homework2 (HW2) are both out
PA1 Design Doc and HW2 due in one week, start early
Work alone on HW2, but in group for PA1
Save your work!
• Save often. Verify file is non-zero. Periodically save to Dropbox,
email.
• Beware of MacOSX 10.5 (leopard) and 10.6 (snow-leopard)
Use your resources
• Lab Section, Piazza.com, Office Hours, Homework Help Session,
• Class notes, book, Sections, CSUGLab
Administrivia
Check online syllabus/schedule
• http://www.cs.cornell.edu/Courses/CS3410/2013sp/schedule.html
Slides and Reading for lectures
Office Hours
Homework and Programming Assignments
Prelims (in evenings):
• Tuesday, February 26th
• Thursday, March 28th
• Thursday, April 25th
Schedule is subject to change
Collaboration, Late, Re-grading Policies
“Black Board” Collaboration Policy
• Can discuss approach together on a “black board”
• Leave and write up solution independently
• Do not copy solutions
Late Policy
• Each person has a total of four “slip days”
• Max of two slip days for any individual assignment
• Slip days deducted first for any late assignment,
cannot selectively apply slip days
• For projects, slip days are deducted from all partners
• 20% deducted per day late after slip days are exhausted
Regrade policy
• Submit written request to lead TA,
and lead TA will pick a different grader
• Submit another written request,
lead TA will regrade directly
• Submit yet another written request for professor to regrade.
Example: Sample Code (Simple)
Assume eight-register machine
Run the following code on a pipelined datapath
add
nand
lw
add
sw
Slides thanks to Sally McKee
r3 r1 r2 ; reg 3 = reg 1 + reg 2
r6 r4 r5 ; reg 6 = ~(reg 4 & reg 5)
r4 20 (r2) ; reg 4 = Mem[reg2+20]
r5 r2 r5 ; reg 5 = reg 2 + reg 5
r7 12(r3) ; Mem[reg3+12] = reg 7
Example: : Sample Code (Simple)
add
nand
lw
add
sw
r3,
r6,
r4,
r5,
r7,
r1, r2;
r4, r5;
20(r2);
r2, r5;
12(r3);
MIPS instruction formats
All MIPS instructions are 32 bits long, has 3 formats
R-type
op
6 bits
I-type
op
6 bits
J-type
rs
rt
5 bits 5 bits
rs
rt
rd shamt func
5 bits
5 bits
6 bits
immediate
5 bits 5 bits
16 bits
op
immediate (target address)
6 bits
26 bits
MIPS Instruction Types
Arithmetic/Logical
• R-type: result and two source registers, shift amount
• I-type: 16-bit immediate with sign/zero extension
Memory Access
• load/store between registers and memory
• word, half-word and byte operations
Control flow
• conditional branches: pc-relative addresses
• jumps: fixed offsets, register absolute
Clock cycle
add
nand
Time Graphs
1
2
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
lw
add
sw
Latency:
Throughput:
Concurrency:
3
4
5
6
7
8
9
EX MEM WB
M
U
X
4
target
+
PC+4
PC+4
R0
R1
regB
R2
R3
Register file
instruction
PC
Inst
mem
regA
0
Bits 16-20
Bits 26-31
IF/ID
valA
R4
R5
R6
valB
R7
extend
Bits 11-15
ALU
result
M
U
X
A
L
U
ALU
result
mdata
Data
mem
M
U
X
data
imm
dest
valB
Rd
Rt
op
ID/EX
M
U
X
dest
dest
op
op
EX/MEM
MEM/WB
data
dest
extend
0
0
IF/ID
ID/EX
M
U
X
EX/MEM
MEM/WB
add 3 1 2
M
U
X
4
0
+
4
0
R0
R1
R3
Register file
add 3 1 2
PC
Inst
mem
R2
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
add 3 1 2
Bits 11-15
Bits 16-20
Bits 26-31
Time: 1
IF/ID
0
0
0
0
M
U
X
A
L
U
0
0
Data
mem
M
U
X
data
0
dest
0
0
0
nop
ID/EX
M
U
X
0
0
nop
nop
EX/MEM
MEM/WB
nand 6 4 5
add 3 1 2
M
U
X
4
0
+
8
4
R0
R2
2
R3
Register file
nand 6 4 5
PC
Inst
mem
R1
1
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
nand 6 4 5
Bits 11-15
Bits 16-20
Bits 26-31
Time: 2
IF/ID
0
0
36
9
3
M
U
X
A
L
U
0
0
Data
mem
M
U
X
data
dest
0
3
2
add
ID/EX
M
U
X
0
0
nop
nop
EX/MEM
MEM/WB
lw 4 20(2)
nand 6 4 5
add 3 1 2
M
U
X
4
4
+
12
8
R0
R2
5
R3
Register file
lw 4 20(2)
PC
Inst
mem
R1
4
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
lw 4 20(2)
Bits 11-15
Bits 16-20
Bits 26-31
Time: 3/ 4
IF/ID
0
0
36
18
9
7
6
M
U
X
A
L
U
45
0
Data
mem
M
U
X
data
dest
9
6
5
3
2
nand
ID/EX
M
U
X
3
3
0
add
nop
EX/MEM
MEM/WB
add 5 2 5
lw 4 20(2)
nand 6 4 5
add 3 1 2
M
U
X
4
8
+
16
12
R0
R2
4
R3
Register file
add 5 2 5
PC
Inst
mem
R1
2
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
add 5 2 5
Bits 11-15
Bits 16-20
Bits 26-31
Time: 4
IF/ID
0
45
18
9
7
18
20
M
U
X
A
L
U
-3 45
0
Data
mem
M
U
X
data
dest
7
0
4
6
5
lw
ID/EX
M
U
X
6
6
3
nand
EX/MEM
3
add
MEM/WB
sw 7 12(3)
add 5 2 5
lw 4 20 (2)
nand 6 4 5
add 3 1 2
M
U
X
4
12
+
20
16
R0
R2
5
R3
Register file
sw 7 12(3)
PC
Inst
mem
R1
2
R4
R5
R6
R7
0
36
9
45
18
7
41
22
extend
Fetch:
sw 7 12(3)
Bits 11-15
Bits 16-20
Bits 26-31
Time: 5
IF/ID
0
-3
9
9
7
M
U
20 X
5
A
L
U
29 -3
45
0
Data
mem
M
U
X
data
dest
18
5
5
0
4
add
ID/EX
M
U
X
4
4
6
lw
EX/MEM
6
3
nand
MEM/WB
sw 7 12(3)
add 5 2 5
lw 4 20(2)
nand 6 4 5
M
U
X
4
16
+
20
R0
R1
3
R2
7
R3
Register file
PC
Inst
mem
R4
R5
R6
R7
0
36
9
45
18
7
-3
22
extend
No more
instructions
Bits 11-15
Bits 16-20
Bits 26-31
Time: 6
IF/ID
0
29
9
45
7
22
12
M
U
X
A
L
U
16 29
-3
99
Data
mem
M
U
X
data
dest
7
0
7
5
5
sw
ID/EX
M
U
X
5
5
4
add
EX/MEM
4
6
lw
MEM/WB
nop
nop
sw 7 12(3)
add 5 2 5
lw 4 20(2)
M
U
X
4
20
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
7
-3
22
0
16
45
M
U
12 X
A
L
U
57 16
Data
mem
Bits 11-15
Bits 16-20
22
0
7
Bits 26-31
Time: 7
IF/ID
data
dest
extend
No more
instructions
0
M
U
99 X
M
U
X
7
7
5
sw
ID/EX
EX/MEM
5
4
add
MEM/WB
nop
nop
nop
sw 7 12(3)
add 5 2 5
M
U
X
4
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
16
-3
22
57
M
U
X
57
Bits 11-15
M
U
X
Bits 16-20
IF/ID
Slides thanks to Sally McKee
0
Data
mem
M
U
X
data
dest
7
Bits 26-31
Time: 8
22
22
extend
No more
instructions
A
L
U
16
5
sw
ID/EX
EX/MEM
MEM/WB
nop
nop
nop
nop
sw 7 12(3)
M
U
X
4
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
16
-3
22
M
U
X
A
L
U
Data
mem
data
dest
extend
No more
instructions
M
U
X
Bits 11-15
M
U
X
Bits 16-20
Bits 21-23
Time: 9
IF/ID
ID/EX
EX/MEM
MEM/WB
Pipelining Recap
Powerful technique for masking latencies
• Logically, instructions execute one at a time
• Physically, instructions execute in parallel
– Instruction level parallelism
Abstraction promotes decoupling
• Interface (ISA) vs. implementation (Pipeline)
Download