lecture 11

advertisement
A Pipelined Processor
Taken from Digital Design and Computer
Architecture by Harris and Harris
7-<1>
Introduction
• Microarchitecture: how to
implement an architecture in
hardware
• Processor:
– Datapath: functional blocks
– Control: control signals
Application
Software
programs
Operating
Systems
device drivers
Architecture
instructions
registers
Microarchitecture
datapaths
controllers
Logic
adders
memories
Digital
Circuits
AND gates
NOT gates
Analog
Circuits
amplifiers
filters
Devices
transistors
diodes
Physics
electrons
7-<2>
Microarchitecture
• Multiple implementations for a single architecture:
– Single-cycle
• Each instruction executes in a single cycle
– Multicycle
• Each instruction is broken up into a series of shorter steps
– Pipelined
• Each instruction is broken up into a series of steps
• Multiple instructions execute at once.
– Microcode
7-<3>
Architectural State
• Determines everything about a processor:
– PC
– 32 registers
– Memory
7-<4>
Instruction Formats
add
sub
or
and
…
lw
sw
…
beq
bne
…
R-Type
op
6 bits
rs
5 bits
rt
rd
shamt
funct
5 bits
5 bits
5 bits
6 bits
I-Type
op
6 bits
rs
5 bits
rt
imm
5 bits
16 bits
J-Type
op
addr
6 bits
26 bits
6-<5>
State Elements: PC, 32 registers, and
Memory
CLK
CLK
PC'
PC
32
32
32
A
RD
Instruction
Memory
5
32
5
5
32
A1
A2
A3
WD3
CLK
WE3
RD1
RD2
Register
File
WE
32
32
32
32
A
RD
Data
Memory
WD
7-<6>
32
Single-Cycle Datapath: lw fetch
• executing lw:
lw (index reg) (destination reg) (immediate offset)
example: lw $s3, 1($0) # read memory word 1 into $s3
I-Type
rt <- DataMemory[rs+imm]
op
6 bits
rs
5 bits
rt
imm
5 bits
16 bits
• STEP 1: Fetch instruction
CLK
CLK
PC'
PC
Instr
A
RD
Instruction
Memory
A1
CLK
WE3
WE
RD1
A
A2
A3
WD3
RD2
Register
File
RD
Data
Memory
WD
7-<7>
Single-Cycle Datapath: lw register read
• STEP 2: Read source operands from register file
CLK
CLK
25:21
PC'
PC
A
RD
Instruction
Memory
Instr
A1
CLK
WE3
WE
RD1
A
A2
A3
WD3
RD2
Register
File
RD
Data
Memory
WD
7-<8>
Single-Cycle Datapath: lw immediate
• STEP 3: Sign-extend the immediate
CLK
CLK
PC'
PC
A
RD
Instr
25:21
A1
CLK
WE3
WE
RD1
A
Instruction
Memory
A2
A3
WD3
RD
Data
Memory
WD
RD2
Register
File
SignImm
15:0
Sign Extend
7-<9>
Single-Cycle Datapath: lw address
• STEP 4: Compute the memory address
ALUControl2:0
PC
A
RD
Instr
25:21
Instruction
Memory
A1
A2
A3
WD3
WE3
RD2
SrcB
Register
File
CLK
Zero
SrcA
RD1
ALU
PC'
010
CLK
CLK
ALUResult
WE
A
RD
Data
Memory
WD
SignImm
15:0
Sign Extend
7-<10>
Single-Cycle Datapath: lw memory read
• STEP 5: Read data from memory and write it back to
register file
RegWrite
ALUControl2:0
1
PC
A
RD
Instruction
Memory
Instr
25:21
20:16
A1
A2
A3
WD3
CLK
WE3
Zero
SrcA
RD1
RD2
SrcB
Register
File
ALU
CLK
PC'
010
CLK
ALUResult
WE
A
RD
Data
Memory
WD
ReadData
SignImm
15:0
Sign Extend
7-<11>
Single-Cycle Datapath: lw PC increment
• STEP 6: Determine the address of the next instruction
ALUControl2:0
RegWrite
010
1
PC
RD
A
Instr
Instruction
Memory
25:21
A1
A2
20:16
A3
+
WD3
WE3
RD2
SrcB
Register
File
WE
Zero
SrcA
RD1
ALU
CLK
PC'
CLK
CLK
ALUResult
RD
Data
Memory
WD
A
ReadData
PCPlus4
SignImm
4
15:0
Sign Extend
Result
7-<12>
Single-Cycle Datapath: sw
• sw: sw (index reg) (source reg) (immediate offset)
I-Type
rs
op
6 bits
5 bits
rt
imm
5 bits
16 bits
• Write data in rt to memory
RegWrite
ALUControl2:0
0
1
A
RD
Instr
Instruction
Memory
25:21
A1
CLK
WE3
20:16
A2
20:16
A3
WD3
Zero
SrcA
RD1
RD2
SrcB
ALU
PC
+
PC'
010
CLK
CLK
MemWrite
ALUResult
WriteData
Register
File
WE
A
RD
Data
Memory
WD
ReadData
PCPlus4
SignImm
4
15:0
Sign Extend
Result
7-<13>
Single-Cycle Datapath: R-type instructions
example: rd <- rt + rs ; add $s0, $s1, $s2
R-Type
Read from rs and rt
op
rs
rt
rd
Write ALUResult to register file
6 bits
5 bits
5 bits
5 bits
Write to rd (instead of rt)
RegWrite
RegDst
1
1
0
PC
A
RD
Instr
Instruction
Memory
25:21
20:16
A1
A2
A3
WD3
CLK
WE3
Zero
SrcA
RD1
0 SrcB
RD2
1
Register
File
20:16
ALUResult
WriteData
shamt
funct
5 bits
6 bits
MemtoReg
0
0
WE
A
RD
Data
Memory
WD
ReadData
0
1
0
15:11
WriteReg4:0
PCPlus4
MemWrite
varies
ALU
CLK
PC'
ALUSrc ALUControl2:0
CLK
+
•
•
•
•
1
SignImm
4
15:0
Sign Extend
Result
7-<14>
Single-Cycle Datapath: beq
• Determine whether values in rs and rt are equal
beq $s0, $s1, target
…
target:
# label
I-Type
rs
op
6 bits
5 bits
rt
imm
5 bits
16 bits
• Calculate branch target address:
BTA = (sign-extended immediate << 2) + (PC+4)
PCSrc
RegWrite
RegDst
0
110
1
A
RD
Instr
Instruction
Memory
25:21
20:16
A1
A2
A3
WD3
CLK
WE3
RD2
0 SrcB
1
Register
File
20:16
+
WriteReg4:0
15:0
WriteData
x
0
WE
A
RD
Data
Memory
WD
ReadData
0
1
1
SignImm
4
ALUResult
MemtoReg
0
15:11
PCPlus4
Zero
SrcA
RD1
ALU
PC
MemWrite
1
Sign Extend
<<2
+
PC'
0
CLK
CLK
0
x
ALUSrc ALUControl2:0 Branch
PCBranch
7-<15>
Result
Complete Single-Cycle Processor
31:26
5:0
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0
Op
ALUSrc
Funct
RegDst
PCSrc
RegWrite
CLK
PC'
PC
1
A
RD
Instr
Instruction
Memory
25:21
20:16
A1
A2
A3
WD3
WE3
RD2
0 SrcB
1
Register
File
20:16
+
WriteReg4:0
15:0
WriteData
A
RD
Data
Memory
WD
ReadData
0
1
1
SignImm
4
ALUResult
WE
0
15:11
PCPlus4
Zero
SrcA
RD1
ALU
0
CLK
Sign Extend
<<2
+
CLK
PCBranch
Result
7-<16>
Review: Processor Performance
Program Execution Time
= (# instructions)(cycles/instruction)(seconds/cycle)
= # instructions x CPI x TC
7-<17>
Single-Cycle Performance
• TC is limited by the critical path (lw)
31:26
5:0
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl 2:0
0
0
PCSrc
Op
ALUSrc
Funct RegDst
RegWrite
PC'
PC
1
A
RD
Instr
Instruction
Memory
25:21
A1
WE3
010
SrcA
RD1
1
20:16
A2
A3
WD3
RD2
0 SrcB
1
Register
File
+
WriteReg4:0
A
RD
Data
Memory
WD
ReadData
0
1
1
SignImm
15:0
WriteData
1
WE
0
15:11
4
ALUResult
0
0
20:16
PCPlus4
Zero
Sign Extend
<<2
+
0
CLK
1
ALU
CLK
CLK
PCBranch
Result
7-<18>
Single-Cycle Performance
• Single-cycle critical path:
Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux +
tRFsetup
• In most implementations, limiting paths are:
– memory, ALU, register file.
– Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
7-<19>
Single-Cycle Performance Example
Element
Parameter
Delay (ps)
Register clock-to-Q
tpcq_PC
30
Register setup
tsetup
20
Multiplexer
tmux
25
ALU
tALU
200
Memory read
tmem
250
Register file read
tRFread
150
Register file setup
tRFsetup
20
Tc =
7-<20>
Single-Cycle Performance Example
Element
Parameter
Delay (ps)
Register clock-to-Q
tpcq_PC
30
Register setup
tsetup
20
Multiplexer
tmux
25
ALU
tALU
200
Memory read
tmem
250
Register file read
tRFread
150
Register file setup
tRFsetup
20
Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
= [30 + 2(250) + 150 + 25 + 200 + 20] ps
= 925 ps
7-<21>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a singlecycle MIPS processor,
Execution Time =
7-<22>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a singlecycle MIPS processor,
Execution Time = # instructions x CPI x TC
= (100 × 109)(1)(925 × 10-12 s)
= 92.5 seconds
7-<23>
Pipelined Processor
• Temporal parallelism
• Divide single-cycle processor into 5 ROUGHLY
EQUIVALENT stages:
– Fetch
– Decode
– Execute
– Memory
– Writeback
• Each stage includes one “slow step”
• Add pipeline registers between stages
• 5 stages => ~ 5 times faster!
• All modern high-performance processors are pipelined.
7-<24>
Single-Cycle vs. Pipelined Performance
Single-Cycle
0
100
200
300
400
500
600
700
800
900
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900
Instr
1
Fetch
Instruction
Decode
Read Reg
Execute
ALU
Memory
Read / Write
Time (ps)
Write
Reg
Fetch
Instruction
2
Decode
Read Reg
Execute
ALU
Memory
Read / Write
Write
Reg
Pipelined
Instr
1
2
3
Fetch
Instruction
Decode
Read Reg
Fetch
Instruction
Execute
ALU
Decode
Read Reg
Fetch
Instruction
Memory
Read/Write
Execute
ALU
Decode
Read Reg
Write
Reg
Memory
Read/Write
Execute
ALU
Write
Reg
Memory
Read/Write
Write
Reg
The length of all pipeline stages is set by the slowest stage
The instruction latency is 5 * 250 ps = 1250 ps
7-<25>
Pipelining Abstraction
1
2
3
4
5
6
7
8
9
10
Time (cycles)
lw
$s2, 40($0)
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw
$s6, 20($s1)
or
$s7, $t3, $t4
IM
lw
$0
RF 40
IM
add
DM
+
$t1
RF $t2
IM
sub
RF
DM
+
$s1
RF $s5
IM
$s2
and
RF
DM
-
$t5
RF $t6
IM
$s3
sw
RF
DM
&
$s1
RF 20
IM
$s4
or
+
$t3
RF $t4
$s5
RF
DM
$s6
RF
DM
|
7-<26>
$s7
RF
Single-Cycle and Pipelined Datapath
CLK
0
PC'
PC
1
A
Instr
RD
Instruction
Memory
25:21
20:16
A1
WE3
A2
A3
WD3
+
CLK
Zero
SrcA
RD1
RD2
0 SrcB
1
Register
File
20:16
0
15:11
1
ALU
CLK
WE
ALUResult
WriteData
ReadData
A
RD
Data
Memory
WD
0
1
WriteReg4:0
PCPlus4
SignImm
4
Sign Extend
<<2
+
15:0
PCBranch
Result
CLK
CLK
CLK
0
PC'
PCF
1
A
RD
Instruction
Memory
CLK
InstrD
25:21
20:16
A1
WE3
A2
A3
WD3
ALUOutW
CLK
CLK
RD2
0 SrcBE
1
Register
File
RdE
0
A
RD
Data
Memory
WD
ReadDataW
0
1
WriteRegE4:0
1
+
15:11
ALUOutM
WriteDataM
WriteDataE
RtE
20:16
WE
ZeroM
SrcAE
RD1
ALU
CLK
SignImmE
15:0
<<2
Sign Extend
PCBranchM
+
4
PCPlus4F
PCPlus4D
PCPlus4E
ResultW
Fetch
Decode
Execute
Memory
Writeback
7-<27>
Corrected Pipelined Datapath
• WriteReg address must arrive at the same time as Result
CLK
CLK
CLK
PC'
PCF
A
RD
Instruction
Memory
InstrD
25:21
20:16
A1
WE3
PCPlus4F
0 SrcBE
1
RtE
RdE
0
ALUOutM
WriteDataE
WriteDataM
WriteRegE4:0
WriteRegM4:0
A
RD
Data
Memory
WD
ReadDataW
0
1
WriteRegW 4:0
1
SignImmE
<<2
+
Sign Extend
PCPlus4D
WE
ZeroM
SrcAE
A2
RD2
A3
Register
WD3
File
15:11
4
CLK
RD1
20:16
15:0
ALUOutW
CLK
+
0
1
CLK
ALU
CLK
PCBranchM
PCPlus4E
ResultW
Fetch
Decode
Execute
Memory
Writeback
7-<28>
Pipelined Control
CLK
5:0
RegWriteM
RegWriteW
MemtoRegE
MemtoRegM
MemtoRegW
MemWriteD
MemWriteE
MemWriteM
BranchD
BranchE
BranchM
Op
ALUControlD
ALUControlE2:0
Funct
ALUSrcD
ALUSrcE
RegDstD
RegDstE
CLK
PC'
PCF
1
A
InstrD
RD
Instruction
Memory
ALUOutW
CLK
25:21
20:16
A1
A2
A3
WD3
CLK
WE3
RD2
0 SrcBE
1
Register
File
RtE
20:16
RdE
0
15:0
4
PCPlus4F
Sign Extend
PCPlus4D
ALUOutM
WriteDataE
WriteDataM
WriteRegE4:0
WriteRegM4:0
A
RD
Data
Memory
WD
ReadDataW
0
1
WriteRegW 4:0
1
+
15:11
WE
ZeroM
SrcAE
RD1
SignImmE
<<2
+
0
PCSrcM
ALU
CLK
CLK
RegWriteE
RegWriteD
Control
MemtoRegD
Unit
31:26
CLK
PCBranchM
PCPlus4E
ResultW
Same control unit as single-cycle processor
Control delayed to proper pipeline stage
7-<29>
Pipeline Hazard
• Occurs when an instruction depends on results
from previous instruction that hasn’t completed.
• Types of hazards:
– Data hazard: register value not written back to register
file yet
– Control hazard: next instruction not decided yet
(caused by branches)
7-<30>
Data Hazard
1
2
3
4
5
6
7
8
Time (cycles)
add $s0, $s2, $s3
and $t0, $s0, $s1
or
$t1, $s4, $s0
sub $t2, $s0, $s5
IM
add
$s2
RF $s3
IM
and
DM
+
$s0
RF
$s0
RF $s1
IM
or
DM
&
$t0
RF
$s4
RF $s0
IM
sub
DM
|
$t1
RF
$s0
RF $s5
-
DM
7-<31>
$t2
RF
Handling Data Hazards
• Insert nops in code at compile time
• Rearrange code at compile time
• Forward data at run time
• Stall the processor at run time
7-<32>
Control Hazards
• beq:
– branch is not determined until the fourth stage of the pipeline
– Instructions after the branch are fetched before branch occurs
– These instructions must be flushed if the branch happens
• Branch misprediction penalty
– number of instruction flushed when branch is taken
– May be reduced by determining branch earlier
7-<33>
Branch Prediction
• Guess whether branch will be taken
– Backward branches are usually taken (loops)
– Perhaps consider history of whether branch was
previously taken to improve the guess
• Good prediction reduces the fraction of branches
requiring a flush
7-<34>
Download