Lecture Notes 4: CPU Performance (Pipeline)

advertisement
Lecture 4:
CPU Performance
A Modern Processor
Intel Core i7
Processor Performance
Lower bounds that characterize the maximum performance:

•
•

•
•
Latency Bound
Occurs when operations must be performed in strict sequence (e.g. data
dependency)
Minimum time to perform the operations sequentially
Throughput Bound
Characterizes the raw computing capacity of the processor’s functional units.
Maximum operations per cycle
Pipelining
s1
s2
s3
stages
stages
s3
s3
s2
s2
s1
s1
Without pipeline
time
With pipeline
time
Pipelining
stages
stages
s3
s3
s2
s2
s1
s1
Without pipeline
time
T1 = s . t . n
s – stages
n – tasks
t – time per stage
Tp = s . t + (n-1).t
Speedup = T1 / Tp =
Throughput =
time
With pipeline
n
Tp
s.n
=
s
.
s+(n-1)
s/n +(1-1/n)
.
Speedup = s
n
Pipelining
10
30
s1
s2
20
s3
stages
stages
s3
s3
s2
s2
s1
s1
Without pipeline
time
With pipeline
time
Slowest stage determines the pipeline performance
Combinatorial
logic
Reg
Computational Pipelines
clock
Comb.log.
A
R
Comb.log.
B
R
Comb.log.
C
R
clock
Limitations of Pipelining

Nonuniform partitioning
•
•
Stage delays may be nonuniform
Throughput is limited by the slowest stage
50ps
Comb.log.
A
20ps
R
150ps
Comb.log.
B
20ps
R
100ps
Comb.log.
C
20ps
R
clock

Deep pipelining
•
•
Large number of stages
Modern processors have deep pipelines (15 or more) to increase
the clock rate.
50ps
Comb.log.
A
20ps
R
50ps
Comb.log.
B
20ps
R
50ps
…
Comb.log.
C
20ps
R
clock
Pipelined Parallel Adder
a4,b4
a3,b3
a2,b2
a1,b1
Pipelined Parallel Adder
c4,d4
a4,b4
c3,d3
a3,b3
c2,d2
a2,b2
c1,d1
a1+b1
Pipelined Parallel Adder
e4,f4
e3,f3
e2,f2
e1,f1
c4,d4
c3,d3
c2,d2
c1+d1
a4,b4
a3,b3
a2+b2
a1+b1
Pipelined Parallel Adder
g4,h4
g3,h3
g2,h2
g1,h1
e4,f4
e3,f3
e2,f2
e1+f1
c4,d4
c3,d3
c2+d2
c1+d1
a4,b4
a3+b3
a2+b2
a1+b1
Pipelined Parallel Adder
g3,h3
g2,h2
g1+h1
e4,f4
e3,f3
e2+f2
e1+f1
c4,d4
c3+d3
c2+d2
c1+d1
a4+b4
a3+b3
g4,h4
a2+b2
a1+b1
Instruction Execution Pipeline
1.
2.
3.
4.
5.
Instruction Fetch Cycle (IF)
•
Fetch current instruction from memory
•
Increment PC
Instruction decode / register fetch cycle (ID)
•
Decode instruction
•
Compute possible branch target
•
Read registers from the register file
Execution / effective address cycle (EX)
•
Form the effective address
•
ALU performs the operation specified by the opcode
Memory access (MEM)
•
Memory read for load instruction
•
Memory write for store instruction
Write-back cycle (WB)
•
Write result into register file
IF
ID
EX
MEM
WB
Instruction Execution Pipeline
IF
ID
EX
stages
WB
MEM
EX
ID
IF
time
MEM
WB
Pipeline Hazards
1.
Structural hazards
2.
Data Hazards
3.
Control Hazards
Pipeline Hazards
Structural Hazards

Arise from resource conflicts when the hardware cannot support all
possible combinations of instructions simultaneously in overlapped
execution.
stages
stall
(bubble)
WB
MEM
EX
ID
IF
time
IF
ID
EX
Mem
Reg
ALU
MEM
Mem
WB
Reg
Pipeline Hazards
Data Hazards

Arise when an instruction depends on the results of a previous
instruction in a way that is exposed by the overlapping of instructions.
ADD
SUB
AND
OR
XOR
R1, R2, R3
stages
R4, R1, R5
WB
R6, R1, R7
MEM
R8, R1, R9
EX
R10, R1, R11
ID
IF
time
IF
ID
EX
Mem
Reg
ALU
MEM
Mem
WB
Reg
Pipeline Hazards
Data Hazards

Forwarding (by-passing)
IF
ID
EX
MEM
WB
Mem
Reg
ALU
Mem
Reg
IF
ID
EX
MEM
WB
Mem
Reg
ALU
Mem
Reg
IF
ID
EX
MEM
WB
Reg
ALU
Mem
Reg
IF
ID
EX
MEM
WB
Reg
ALU
Mem
Reg
Mem
Mem
Pipeline Hazards
Control (Branch) Hazards

Arise from pipelining of instructions (e.g. branch) that change PC.
for i=n to 1
ci = a i + bi
LOOP:
LOAD 100,X
ADD 200,X
STORE 300,X
DECX
BNE LOOP
...
stages
WB
MEM
EX
ID
IF
time
Pipeline Hazards
Control (Branch) Hazards

Freeze (flush)
L1:
stages
WB
MEM
EX
BRA L1
...
NEXT
NEXT
NEXT
NEXT
ID
IF
time
Pipeline Hazards
Control (Branch) Hazards

Predicted-not-taken
L1:
BNE L1
NEXT
NEXT
NEXT
...
NEXT
NEXT
NEXT
stages
WB
MEM
EX
ID
IF
Not taken
Taken
time
Pipeline Hazards
Control (Branch) Hazards

Predicted-taken
L1:
BNE L1
NEXT
NEXT
NEXT
...
NEXT
NEXT
NEXT
stages
WB
MEM
EX
ID
IF
Not taken
Taken
time
Pipeline Hazards
Control (Branch) Hazards

Delayed branch
branch instruction
sequential successor
Branch target if taken
stages
ADD R1,R2,R3
if (R2=0) branch L1
delay slot
NEXT
NEXT
...
L1: NEXT
NEXT
NEXT
if (R2=0) branch L1
ADD R1,R2,R3
NEXT
NEXT
...
L1: NEXT
NEXT
NEXT
WB
MEM
EX
ID
IF
Not taken
Taken
time
Levels of Parallelism




Bit level parallelism
•
Within arithmetic logic circuits
Instruction level parallelism
•
Multiple instructions execute per clock cycle
Memory system parallelism
•
Overlap of memory operations with computation
Operating system parallelism
•
•
More than one processor
Multiple jobs run in parallel on SMP
•
•
Loop level
Procedure level
Download