Outline

advertisement
55:035
Computer Architecture and Organization
Lecture 10
Outline

Pipelining






Basics
Throughput
Execution Time
Pipeline Registers
Program Execution
Hazards



Structural
Data
Control
55:035 Computer Architecture and Organization
2
ILP: Instruction Level Parallelism
Single-cycle and multi-cycle datapaths execute
one instruction at a time.
 How can we get better performance?
 Answer: Execute multiple instruction at a time:



Pipelining – Enhance a multi-cycle datapath to fetch one
instruction every cycle.
Parallelism – Fetch multiple instructions every cycle.
55:035 Computer Architecture and Organization
3
Automobile Team Assembly
1 hour
1 hour
1 hour
1 hour
1 car assembled every four hours
6 cars per day
180 cars per month
2,040 cars per year
55:035 Computer Architecture and Organization
4
Automobile Assembly Line
Task 1
1 hour
Mecahnical
Task 2
1 hour
Task 3
1 hour
Electrical
Painting
Task 4
1 hour
Testing
First car assembled in 4 hours (pipeline latency)
thereafter 1 car per hour
21 cars on first day, thereafter 24 cars per day
717 cars per month
8,637 cars per year
55:035 Computer Architecture and Organization
5
Throughput: Team Assembly
Red car
started
Red car
completed
Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing
Blue car
started
Blue car
completed
Time of assembling one car
Time
=
n
hours
where n is the number of nearly equal subtasks,
each requiring 1 unit of time
Throughput
=
1/n
cars per unit time
55:035 Computer Architecture and Organization
6
Throughput: Assembly Line
Car 1 Mechanical Electrical
Car 2
Painting
Mechanical Electrical
Car 3
Testing
Painting
Mechanical Electrical
Testing
Painting
Mechanical Electrical
Car 4
Car 1
complete
.
Testing
Painting
Testing
Car 2
complete
time
.
Time to complete first car
Cars completed in time T
Throughput
=n
time units (latency)
=T–n+1
= 1- (n - 1)/ T
car per unit time
Throughput (assembly line)
───────────────────
Throughput (team assembly)
=
1 – (n - 1)/ T
──────── =
1/n
55:035 Computer Architecture and Organization
n (n – 1)
n – ───── →
T
n
as T→∞
7
Some Features of Assembly Line
Electrical parts
delivered (JIT)
Task 1
1 hour
Mechanical
Stall assembly line
to fix the cause of
defect
Task 2
1 hour
Task 3
1 hour
Task 4
1 hour
Electrical
Painting
Testing
3 cars in the assembly line are suspects,
to be removed (flush pipeline)
55:035 Computer Architecture and Organization
Defect
found
8
Pipelining in a Computer




Divide datapath into nearly equal tasks, to be
performed serially and requiring non-overlapping
resources.
Insert registers at task boundaries in the datapath;
registers pass the output data from one task as input
data to the next task.
Synchronize tasks with a clock having a cycle time
that just exceeds the time required by the longest
task.
Break each instruction down into a fixed number of
tasks so that instructions can be executed in a
staggered fashion.
55:035 Computer Architecture and Organization
9
Single-Cycle Datapath
Instruction
class
Instr.
fetch
(IF)
Instr.
Decode
(also reg.
file read)
(ID)
Execution
(ALU
Operation)
(EX)
Data
access
(MEM)
Write
Back
(Reg.
file
write)
(WB)
Total
time
lw
2ns
1ns
2ns
2ns
1ns
8ns
sw
2ns
1ns
2ns
2ns
R-format
add, sub, and, or, slt
2ns
1ns
2ns
B-format, beq
2ns
1ns
2ns
8ns
1ns
8ns
8ns
No operation on data; idle time equalizes instruction length to a fixed clock period.
55:035 Computer Architecture and Organization
10
Execution Time: Single-Cycle
0
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2
IF ID
4
EX
6
8
10
12
14
16
.
.
Time (ns)
MEM WB
IF ID
EX MEM WB
IF ID
EX MEM WB
Clock cycle time = 8 ns
Total time for executing three lw instructions = 24 ns
55:035 Computer Architecture and Organization
11
Pipelined Datapath
Instruction
class
Instr.
fetch
(IF)
Instr.
Decode
(also reg.
file read)
(ID)
Execution (ALU
Operation)
(EX)
Data
access
(MEM)
lw
Write
Back
(Reg. file
write)
(WB)
Total
time
2ns
1ns
2ns
2ns
2ns
1ns
2ns
10ns
sw
2ns
1ns
2ns
2ns
2ns
1ns
2ns
10ns
R-format: add, sub,
and, or, slt
2ns
1ns
2ns
2ns
2ns
1ns
2ns
10ns
B-format:
beq
2ns
1ns
2ns
2ns
2ns
1ns
2ns
10ns
No operation on data; idle time inserted to equalize instruction lengths.
55:035 Computer Architecture and Organization
12
Execution Time: Pipeline
0
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2
IF
4
6
8
ID
EX
MEM
IF
ID
EX
IF
ID
10
12
14
16
.
.
Time (ns)
RW
MEM RW
EX
MEM RW
Clock cycle time = 2 ns, four times faster than single-cycle clock
Total time for executing three lw instructions = 14 ns
Performance ratio
=
Single-cycle time
────────────
Pipeline time
55:035 Computer Architecture and Organization
=
24
── = 1.7
14
13
Pipeline Performance
Clock cycle time = 2 ns
1,003 lw instructions:
Total time for executing 1,003 lw instructions
Performance ratio
=
=
2,014 ns
Single-cycle time
────────────
Pipeline time
=
8,024
──── = 3.98
2,014
80,024 / 20,014
=
3.998 → Clock cycle ratio (4)
10,003 lw instructions:
Performance ratio
=
Pipeline performance approaches clock-cycle ratio for long programs.
55:035 Computer Architecture and Organization
14
Single-Cycle Datapath
16-20
11-15
1 mux 0
PC
WB:
writeback
ALU
1 mux 0
MEM: mem.
access
MemtoReg
zero
Data
mem.
MemWrite
MemRead
0 mux 1
RegWrite
ALUSrc
21-25
Instr.
mem.
Branch
ALU
26-31
EX: Execute,
address calc.
1 mux 0
opcode
Reg. File
Add
4
CONTROL
ID: Instr. decode,
reg. file read
IF: Instr. fetch
RegDst
0-15
Sign
ext.
Shift
left 2
ALUOp
ALU
Cont.
0-5
55:035 Computer Architecture and Organization
15
Pipelining of RISC Instructions
Fetch
Instruction
Examine
Opcode
Fetch
Operands
Perform
Operation
Store
Result
IF
ID
EX
MEM
WB
Instruction
Fetch
Decode
instruction and
Fetch operands
Execute
Memory
Operation
Write
Back
to Reg
file
Although an instruction takes five clock cycles,
one instruction is completed every cycle.
55:035 Computer Architecture and Organization
16
11-15
ALU
ALU
zero
Data
mem.
MemWrite
MemRead
0 mux 1
This requires a
CONTROL not too
different from
single-cycle
0-15
1 mux 0
16-20
MEM/WB
RegWrite
ALUSrc
21-25
Instr.
mem.
MemtoReg
1 mux 0
26-31
PC
EX/MEM
Branch
Reg. File
opcode
CONTROL
4
ID/EX
Add
IF/ID
1 mux 0
Pipeline Registers
RegDst
Sign
ext.
Shift
left 2
ALUOp ALU
Cont.
0-5
55:035 Computer Architecture and Organization
17
Pipeline Register Functions

Four pipeline registers are added:
Register
name
Data held
IF/ID
PC+4, Instruction word (IW)
ID/EX
PC+4, R1, R2, IW(0-15) sign ext., IW(11-15)
EX/MEM
PC+4, zero, ALUResult, R2, IW(11-15) or IW(16-20)
MEM/WB
M[ALUResult], ALUResult, IW(11-15) or IW(16-20)
55:035 Computer Architecture and Organization
18
ID/EX
EX/MEM
Shift
left 2
opcode
1 mux 0
IF/ID
ALU
4
Add
Pipelined Datapath
MEM/WB
26-31
11-15 for R-type
16-20 for I-type lw
Data
mem
0 mux 1
ALU
16-20
1 mux 0
PC
Instr
mem
zero
Reg. File
21-25
Sgn
ext
0-15
55:035 Computer Architecture and Organization
19
Five-Cycle Pipeline
55:035 Computer Architecture and Organization
REG.
FILE
WRITE
CC5
MEM/WB
DM
CC4
EX/MEM
CC3
ALU
ID, REG.
FILE
READ
ID/EX
CC2
IF/ID
IM
CC1
20
Add Instruction
add
IF
ID
read $s1
read $s2
EX
add
$s1+$s2
CC5
MEM
55:035 Computer Architecture and Organization
REG.
FILE
WRITE
CC4
MEM/WB
CC3
DM
CC2
EX/MEM
CC1
ALU

ID, REG.
FILE
READ
ID/EX

Machine instruction word
000000 10001 10010 01000 00000 100000
opcode $s1 $s2 $t0
function
IF/ID

$t0, $s1, $s2
IM

WB
write $t0
21
11-15 for R-type
16-20 for I-type lw
t0
16-20
s2
zero
$s1
$s2
ALU
PC
Instr
mem
Reg. File
26-31 s1
MEM/WB
Sign
ext.
addr
Data
mem
data
0 mux 1
Shift
left 2
opcode
21-25
EX/MEM
1 mux 0
ID/EX
ALU
IF/ID
1 mux 0
4
Add
Pipelined Datapath Executing add
0-15
55:035 Computer Architecture and Organization
22
Load Instruction
lw $t0, 1200 ($t1)
 100011 01001 01000 0001 0010 0000 0000
 opcode
$t1 $t0
1200

IF
ID
read $t1
sign ext
1200
EX
MEM
add
read
$t1+1200 M[addr]
55:035 Computer Architecture and Organization
REG.
FILE
WRITE
CC5
MEM/WB
DM
CC4
EX/MEM
CC3
ALU
ID, REG.
FILE
READ
ID/EX
CC2
IF/ID
IM
CC1
WB
write $t0
23
PC
Instr
mem
16-20
11-15 for R-type
16-20 for I-type lw
t0
0-15 1200
$t1
MEM/WB
zero
Sign
ext.
55:035 Computer Architecture and Organization
addr
Data
mem
data
0 mux 1
21-25
Reg. File
26-31 t1
Shift
left 2
1 mux 0
opcode
EX/MEM
ALU
ID/EX
ALU
IF/ID
1 mux 0
4
Add
Pipelined Datapath Executing lw
24
Store Instruction
sw $t0, 1200 ($t1)
 101011 01001 01000 0001 0010 0000 0000
 opcode
$t1
$t0
1200

IF
ID
read $t1
sign ext
1200
EX
MEM
add
write
$t1+1200 M[addr]
(addr)
← $t0
55:035 Computer Architecture and Organization
REG.
FILE
WRITE
CC5
MEM/WB
DM
CC4
EX/MEM
CC3
ALU
ID, REG.
FILE
READ
ID/EX
CC2
IF/ID
IM
CC1
WB
25
16-20
t0
zero
$t1
$t0
ALU
PC
Instr
mem
Reg. File
26-31 t1
Sign
ext.
11-15 for R-type
16-20 for I-type lw
MEM/WB
addr
Data
mem
0 mux 1
Shift
left 2
opcode
21-25
EX/MEM
1 mux 0
ID/EX
ALU
IF/ID
1 mux 0
4
Add
Pipelined Datapath Executing sw
data
0-15 1200
55:035 Computer Architecture and Organization
26
Executing a Program
Consider a five-instruction segment:
lw
sub
add
lw
add
$10, 20($1)
$11, $2, $3
$12, $3, $4
$13, 24($1)
$14, $5, $6
55:035 Computer Architecture and Organization
27
55:035 Computer Architecture and Organization
ALU
EX/MEM
ID, REG.
FILE
READ
ID/EX
REG.
FILE
WRITE
MEM/WB
REG.
FILE
WRITE
lw
$10, 20($1)
sub
$11, $2, $3
Program instructions
REG.
FILE
WRITE
MEM/WB
DM
REG.
FILE
WRITE
MEM/WB
DM
CC5
REG.
FILE
WRITE
MEM/WB
DM
EX/MEM
DM
ID/EX
IF/ID
ALU
ALU
ID, REG.
FILE
READ
IM
CC4
EX/MEM
ID/EX
IF/ID
CC3
MEM/WB
DM
ALU
ID, REG.
FILE
READ
IM
CC2
EX/MEM
EX/MEM
ID/EX
IF/ID
$14, $5, $6
ALU
add
ID, REG.
FILE
READ
$13, 24($1)
IM
lw
ID/EX
$12, $3, $4
IF/ID
add
IM
CC1
ID, REG.
FILE
READ
IF/ID
IM
Program Execution
time
28
CC5
ID/EX
EX/MEM
Add
IF/ID
4
Shift
left 2
opcode
ALU
IF: add $14, $5, $6
1 mux 0
MEM:
ID: lw $13, 24($1) EX: add $12, $3, $4 sub $11, $2, $3
WB:
lw $10, 20($1)
MEM/WB
26-31
zero
Data
mem.
0 mux 1
ALU
16-20
1 mux 0
PC
Instr
mem
Reg. File
21-25
Sign
ext.
11-15 for R-type
16-20 for I-type lw
0-15
55:035 Computer Architecture and Organization
29
Advantages of Pipeline

After the fifth cycle (CC5), one instruction is completed
each cycle; CPI ≈ 1, neglecting the initial pipeline latency
of 5 cycles.





Pipeline latency is defined as the number of stages in the
pipeline, or
The number of clock cycles after which the first instruction is
completed.
The clock cycle time is about four times shorter than that
of single-cycle datapath and about the same as that of
multicycle datapath.
For multicycle datapath, CPI = 3. ….
So, pipelined execution is faster, but . . .
55:035 Computer Architecture and Organization
30
Pipeline Hazards
Definition: Hazard in a pipeline is a situation in
which the next instruction cannot complete
execution one clock cycle after completion of the
present instruction.
 Three types of hazards:




Structural hazard (resource conflict)
Data hazard
Control hazard
55:035 Computer Architecture and Organization
31
Structural Hazard
Two instructions cannot execute due to a
resource conflict.
 Example: Consider a computer with a common
data and instruction memory. The fourth cycle of
a lw instruction requires memory access
(memory read) and at the same time the first
cycle of the fourth instruction requires instruction
fetch (memory read). This will cause a memory
resource conflict.

55:035 Computer Architecture and Organization
32
add
$12, $3, $4
lw
$13, 24($1)
55:035 Computer Architecture and Organization
EX/MEM
IM/DM
ID/EX
ALU
REG.
FILE
WRITE
REG.
FILE
WRITE
MEM/WB
IM/DM
REG.
FILE
WRITE
MEM/WB
CC5
lw
$10, 20($1)
sub
$11, $2, $3
Program instructions
REG.
FILE
WRITE
MEM/WB
IM/DM
MEM/WB
ALU
ID, REG.
FILE
READ
EX/MEM
IM/DM
ALU
CC4
EX/MEM
ID/EX
ID, REG.
FILE
READ
EX/MEM
ID/EX
CC3
IF/ID
IM/DM
Common data
and instr. Mem.
IF/ID
ALU
ID, REG.
FILE
READ
CC2
IM/DM
ID/EX
CC1
IF/ID
IM/DM
ID, REG.
FILE
READ
IF/ID
IM/DM
Example of Structural Hazard
time
Needed by two instructions
33
Possible Remedies for Structural Hazards
Provide duplicate hardware resources in
datapath.
 Control unit or compiler can insert delays (no-op
cycles) between instructions. This is known as
pipeline stall or bubble.

55:035 Computer Architecture and Organization
34
lw
$13, 24($1)
55:035 Computer Architecture and Organization
Stall (bubble)
Program instructions
REG.
FILE
WRITE
MEM/WB
IM/DM
EX/MEM
REG.
FILE
WRITE
lw
$10, 20($1)
sub
$11, $2, $3
REG.
FILE
WRITE
IM/DM
ALU
REG.
FILE
WRITE
MEM/WB
CC5
MEM/WB
IM/DM
EX/MEM
ALU
ID/EX
ID, REG.
FILE
READ
EX/MEM
ID/EX
CC4
MEM/WB
ALU
ID, REG.
FILE
READ
CC3
IF/ID
EX/MEM
ID/EX
IF/ID
IM/DM
ALU
ID, REG.
FILE
READ
CC2
IM/DM
ID/EX
$12, $3, $4
IF/ID
IM/DM
ID, REG.
FILE
READ
CC1
IM/DM
add
IF/ID
IM/DM
Stall (Bubble) for Structural Hazard
time
35
Data Hazard
Data hazard means that an instruction cannot be
completed because the needed data, being
generated by another instruction in the pipeline,
is not available.
 Example: consider two instructions:



add $s0, $t0, $t1
sub $t2, $s0, $t3
# needs $s0
55:035 Computer Architecture and Organization
36
Example of Data Hazard
time
CC5
add
$s0, $t0, $t1
sub
$t2, $s0, $t3
Program instructions
MEM/WB
DM
REG.
FILE
WRITE
Write s0 in CC5
MEM/WB
REG.
FILE
WRITE
EX/MEM
DM
ALU
EX/MEM
ALU
CC4
IF/ID
ID, REG.
FILE
READ
ID/EX
IM
CC3
ID/EX
CC2
IF/ID
ID, REG.
FILE
READ
IM
CC1
Read s0 and t3 in CC3
We need to read s0 from reg file in cycle 3
But s0 will not be written in reg file until cycle 5
However, s0 will only be used in cycle 4
And it is available at the end of cycle 3
55:035 Computer Architecture and Organization
37
Forwarding or Bypassing
Output of a resource used by an instruction is
forwarded to the input of some resource being
used by another instruction.
 Forwarding can eliminate some, but not all, data
hazards.

55:035 Computer Architecture and Organization
38
Forwarding for Data Hazard
CC5
time
add
$s0, $t0, $t1
sub
$t2, $s0, $t3
Program instructions
DM
REG.
FILE
WRITE
REG.
FILE
WRITE
MEM/WB
MEM/WB
Write s0 in CC5
EX/MEM
DM
EX/MEM
CC4
ALU
IF/ID
IM
ALU
CC3
ID, REG.
FILE
READ
ID/EX
CC2
IF/ID
ID, REG.
FILE
READ
ID/EX
IM
CC1
Read s0 and t3 in CC3
55:035 Computer Architecture and Organization
39
Forwarding Unit Hardware
Source reg.
IDs from
opcode
MEM/WB
Data
Mem.
MUX
ALU
FORW. MUX
Data
to reg.
file
EX/MEM
FORW. MUX
ID/EX
Destination registers
Forwarding
Unit
55:035 Computer Architecture and Organization
40
Forwarding Alone May Not Work
time
CC5
lw
$s0, 20($s1)
sub
$t2, $s0, $t3
Program instructions
DM
REG.
FILE
WRITE
REG.
FILE
WRITE
MEM/WB
MEM/WB
Write s0 in CC5
EX/MEM
DM
ALU
EX/MEM
ALU
CC4
IF/ID
ID, REG.
FILE
READ
ID/EX
IM
CC3
ID/EX
CC2
IF/ID
ID, REG.
FILE
READ
IM
CC1
Read s0 and t3 in CC3
data needed by sub
(data hazard)
data available from memory
only at the end of cycle 4
55:035 Computer Architecture and Organization
41
Use Bubble and Forwarding
time
CC5
MEM/WB
REG.
FILE
WRITE
ALU
Write s0 in CC5
ID/EX
DM
CC4
EX/MEM
ALU
CC3
ID/EX
CC2
IF/ID
ID, REG.
FILE
READ
IM
CC1
lw
$s0, 20($s1)
55:035 Computer Architecture and Organization
Program instructions
REG.
FILE
WRITE
MEM/WB
DM
EX/MEM
$t2, $s0, $t3
IF/ID
ID, REG.
FILE
READ
sub
IM
stall
(bubble)
42
Hazard Detection Unit Hardware
Source reg.
IDs from
opcode
to reg.
file
ALU
0
EX/MEM
FORW. MUX
IF/ID
PC
Instruction
Control
ID/EX
FORW. MUX
Hazard
Detection
Unit
NOP MUX
Disable
write
Forwarding
Unit
55:035 Computer Architecture and Organization
MEM/WB
Data
Mem.
Control
signals
43
Resolving Hazards
Hazards are resolved by Hazard detection and
forwarding units.
 Compiler’s understanding of how these units
work can improve performance.

55:035 Computer Architecture and Organization
44
Avoiding Stall by Code Reorder
C code:
A = B + E;
C = B + F;
MIPS code:
lw
$t1,
lw
$t2,
add
$t3,
sw
$t3,
lw
$t4,
add
$t5,
sw
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16,($t0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
$t1 written
$t2 written
$t1, $t2 needed
.
.
.
.
.
55:035 Computer Architecture and Organization
.
.
.
.
.
.
.
.
.
.
$t4 written
$t4 needed
.
.
.
.
.
45
Reordered Code
C code:
A = B + E;
C = B + F;
MIPS code:
lw
$t1,
lw
$t2,
lw
$t4,
add
$t3,
sw
$t3,
add
$t5,
sw
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16,($t0)
no hazard
no hazard
55:035 Computer Architecture and Organization
46
Control Hazard
Instruction to be fetched is not known!
 Example: Instruction being executed is branchtype, which will determine the next instruction:






add $4, $5, $6
beq $1, $2, 40
next instruction
...
40
and $7, $8, $9
55:035 Computer Architecture and Organization
47
beq $1, $2, 40
Program instructions
CC5
MEM/WB
REG.
FILE
WRITE
DM
add
EX/MEM
ALU
MEM/WB
REG.
FILE
WRITE
MEM/WB
REG.
FILE
WRITE
CC4
DM
EX/MEM
Stall (bubble)
IF/ID
ID, REG.
FILE
READ
ID/EX
DM
CC3
ALU
EX/MEM
ALU
CC2
IM
next instruction or
and
$7, $8, $9
IF/ID
ID, REG.
FILE
READ
ID/EX
CC1
IM
IF/ID
ID, REG.
FILE
READ
ID/EX
IM
Stall on Branch
time
$4, $5, $6
Why Only One Stall?

Extra hardware in ID phase:



Additional ALU to compute branch address
Comparator to generate zero signal
Hazard detection unit writes the branch address in PC
55:035 Computer Architecture and Organization
49
Ways to Handle Branch
Stall or bubble
 Branch prediction:


Heuristics





Next instruction
Prediction based on statistics (dynamic)
Hardware decision (dynamic)
Prediction error: pipeline flush
Delayed branch
55:035 Computer Architecture and Organization
50
Delayed Branch Example

Stall on branch





add $4, $5, $6
beq $1, $2, skip
next instruction
...
skip or $7, $8, $9

Delayed branch





beq $1, $2, skip
add $4, $5, $6
next instruction
...
skip or $7, $8, $9
Instruction executed irrespective
of branch decision
55:035 Computer Architecture and Organization
51
next instruction or
skip or $7, $8, $9
55:035 Computer Architecture and Organization
REG. FILE
WRITE
REG. FILE
WRITE
Program instructions
REG. FILE
WRITE
MEM/WB
DM
MEM/WB
DM
CC5
MEM/WB
DM
EX/MEM
ALU
EX/MEM
ALU
ID, REG.
FILE
READ
EX/MEM
ALU
ID/EX
CC4
ID/EX
ID/EX
ID, REG.
FILE
READ
IF/ID
ID, REG.
FILE
READ
CC3
IF/ID
IM
add $4, $5, $6
CC2
IM
beq $1, $2, skip
IF/ID
CC1
IM
Delayed Branch
time
52
Summary: Hazards

Structural hazards



Data hazards



Cause: resource conflict
Remedies: (i) hardware resources, (ii) stall (bubble)
Cause: data unavailablity
Remedies: (i) forwarding, (ii) stall (bubble), (iii) code
reordering
Control hazards


Cause: out-of-sequence execution (branch or jump)
Remedies: (i) stall (bubble), (ii) branch prediction/pipeline
flush, (iii) delayed branch/pipeline flush
55:035 Computer Architecture and Organization
53
Download