Pipelining I Systems I Topics Pipelining principles

advertisement
Systems I
Pipelining I
Topics



Pipelining principles
Pipeline overheads
Pipeline registers and stages
Overview
What’s wrong with the sequential (SEQ) Y86?



It’s slow!
Each piece of hardware is used only a small fraction of time
We would like to find a way to get more performance with
only a little more hardware
General Principles of Pipelining


Goal
Difficulties
Creating a Pipelined Y86 Processor



Rearranging SEQ
Inserting pipeline registers
Problems with data and control hazards
2
Real-World Pipelines: Car Washes
Sequential
Parallel
Pipelined
Idea



Divide process into
independent stages
Move objects through stages
in sequence
At any given times, multiple
objects being processed
3
Laundry example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
A
B
C
D
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Stasher” takes 30 minutes
to put clothes into drawers
Slide courtesy of D. Patterson
4
Sequential Laundry
6 PM
T
a
s
k
O
r
d
e
r
A
7
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry take?
Slide courtesy of D. Patterson
5
Pipelined Laundry: Start ASAP
6 PM
T
a
s
k
7
8
9
30 30 30 30 30 30 30
10
11
12
1
2 AM
Time
A
B
C
O
D
r
d
e
r
Pipelined laundry takes 3.5 hours for 4 loads!
Slide courtesy of D. Patterson
6
Pipelining Lessons
6 PM
T
a
s
k
7
8
Time
30 30 30 30 30 30 30
A
B
O
r
d
e
r
9
Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
Multiple tasks operating
simultaneously using
different resources
Potential speedup = Number
pipe stages
C
Pipeline rate limited by slowest
pipeline stage
D
Unbalanced lengths of pipe
stages reduces speedup
Time to “fill” pipeline and time
to “drain” it reduces speedup
Stall for Dependences
Slide courtesy of D. Patterson
7
Latency and Throughput
Latency: time to complete an operation
Throughput: work completed per unit time
Consider plumbing


Low latency: turn on faucet and water comes out
High bandwidth: lots of water (e.g., to fill a pool)
What is “High speed Internet?”



Low latency: needed to interactive gaming
High bandwidth: needed for downloading large files
Marketing departments like to conflate latency and
bandwidth…
8
Relationship between Latency and
Throughput
Latency and bandwidth only loosely coupled

Henry Ford: assembly lines increase bandwidth without
reducing latency
My factory takes 1 day to make a Model-T ford.




But I can start building a new car every 10 minutes
At 24 hrs/day, I can make 24 * 6 = 144 cars per day
A special order for 1 green car, still takes 1 day
Throughput is increased, but latency is not.
Latency reduction is difficult
Often, one can buy bandwidth


E.g., more memory chips, more disks, more computers
Big server farms (e.g., google) are high bandwidth
9
Computational Example
300 ps
20 ps
Combinational
logic
R
e
g
Delay = 320 ps
Throughput = 3.12 GOPS
Clock
System



Computation requires total of 300 picoseconds
Additional 20 picoseconds to save result in register
Must have clock cycle of at least 320 ps
10
3-Way Pipelined Version
100 ps
20 ps
100 ps
20 ps
100 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
20 ps
R
Delay = 360 ps
e
Throughput = 8.33 GOPS
g
Clock
System


Divide combinational logic into 3 blocks of 100 ps each
Can begin new operation as soon as previous one passes
through stage A.
 Begin new operation every 120 ps

Overall latency increases
 360 ps from start to finish
11
Pipeline Diagrams
Unpipelined
OP1
OP2
OP3

Time
Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
A
B
C
A
B
C
A
B
OP3
C
Time

Up to 3 operations in process simultaneously
12
Operating a Pipeline
239
241 300 359
Clock
OP1
A
OP2
B
C
A
B
C
A
B
OP3
0
120
240
360
C
480
640
Time
100 ps
20 ps
100 ps
20 ps
100 ps
20 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
R
e
g
Clock
13
Limitations: Nonuniform Delays
50 ps
20 ps
150 ps
20 ps
100 ps
Comb.
logic
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
A
OP1
OP2
A
B
OP3
B
A
R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g
Clock
C
A
20 ps
C
B
C
Time



Throughput limited by slowest stage
Other stages sit idle for much of the time
Challenging to partition system into balanced stages
14
Limitations: Register Overhead
50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Clock


R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Delay = 420 ps, Throughput = 14.29 GOPS
As try to deepen pipeline, overhead of loading registers
becomes more significant
Percentage of clock cycle spent loading register:
 1-stage pipeline:
 3-stage pipeline:
 6-stage pipeline:

Comb.
logic
6.25%
16.67%
28.57%
High speeds of modern processor designs obtained through
very deep pipelining
15
CPU Performance Equation
3 components to execution time:
CPU time 
Seconds Instructio ns
Cycles
Seconds



Program
Program
Instructio n Cycle
Factors affecting CPU execution time:
Program
Compiler
Inst. Set
Organization
MicroArch
Technology
Inst. Count
X
X
X
CPI
(X)
X
X
X
Clock Rate
(X)
X
X
X
• Consider all three elements when optimizing
• Workloads change!
16
Cycles Per Instruction (CPI)
Depends on the instruction
CPIi  Execution time of instructio n i  Clock Rate
Average cycles per instruction
n
CPI   CPIi  Fi
i 1
Example:
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
ICi
where Fi 
ICtot
Cycles CPI(i) %time
1
0.5
33%
2
0.4
27%
2
0.2
13%
2
0.4
27%
CPI(total) 1.5
17
Comparing and Summarizing
Performance
Fair way to summarize performance?
Capture in a single number?
Example: Which of the following machines is best?
Computer A
Program 1
1
Program 2
1000
Total Time
1001
Computer B
10
100
110
Computer C
20
20
40
18
Means
Arithmetic mean
Geometric mean
1 n
Ti

n i 1
 n 
  Ti 
 i 1 
1
n
Can be weighted:
aiTi
Represents total execution time
Should not be used for aggregating
normalized numbers
Consistent independent of reference
Best for combining results
Best for normalized results
1 n
ln( Geo)   ln( Ti )
n i 1
19
What is the geometric mean of 2 and 8?


A. 5
B. 4
20
Is Speed the Last Word in
Performance?
Depends on the application!
Cost

Not just processor, but other components (ie. memory)
Power consumption

Trade power for performance in many applications
Capacity

Many database applications are I/O bound and disk
bandwidth is the precious commodity
21
Revisiting the Performance Eqn
CPU time 
Seconds Instructio ns
Cycles
Seconds



Program
Program
Instructio n Cycle
Instruction Count: No change
Clock Cycle Time


Improves by factor of almost N for N-deep pipeline
Not quite factor of N due to pipeline overheads
Cycles Per Instruction




In ideal world, CPI would stay the same
An individual instruction takes N cycles
But we have N instructions in flight at a time
So - average CPIpipe = CPIno_pipe * N/N
Thus performance can improve by up to factor of N
22
Data Dependencies

1
irmovl $50, %eax
2
addl %eax,
3
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruction used as operand for another
 Read-after-write (RAW) dependency


Very common in actual programs
Must make sure our pipeline handles these properly
 Get correct results
 Minimize performance impact
OP1
OP2
OP3
Time
23
Data Hazards
Comb.
logic
A
OP1
OP2
R
e
g
A
Comb.
logic
B
R
e
g
Comb.
logic
C
Clock
B
C
A
B
C
A
B
C
A
B
OP3
OP4
R
e
g
C
Time


Result does not feed back around in time for next operation
Pipelining has changed behavior of system
24
Data Dependencies in Processors

1
irmovl $50, %eax
2
addl %eax ,
3
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruction used as operand for another
 Read-after-write (RAW) dependency


Very common in actual programs
Must make sure our pipeline handles these properly
 Get correct results
 Minimize performance impact
25
newPC
SEQ Hardware
New
PC
PC
 Stages occur in sequence
 One operation in process
at a time
valM
data out
read
write
Addr
 One stage for each logical
pipeline operation





Fetch (get next instruction
from memory)
Decode (figure out what
instruction does and get
values from regfile)
Execute (compute)
Memory (access data
memory if necessary)
Write back (write any
instruction result to
regfile)
Data
Data
memory
memory
Mem.
control
Memory
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
Data
ALU
fun.
ALU
B
valA
Decode
A
valB
dstE dstM srcA
srcB
dstE dstM srcA
srcB
B
Register
Register M
file
file E
Write back
icode
Fetch
ifun
rA
rB
Instruction
Instruction
memory
memory
valC
valP
PC
PC
increment
increment
PC
26
valM
SEQ+ Hardware
data out
read
Data
Data
memory
memory
Mem.
control
Memory
write
Addr


Still sequential
implementation
Reorder PC stage to put at
beginning
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
Data
ALU
fun.
ALU
B
PC Stage


Task is to select PC for
current instruction
Based on results
computed by previous
instruction
Processor State


PC is no longer stored in
register
But, can determine PC
based on other stored
information
valA
valB
dstE dstM srcA srcB
dstE dstM srcA srcB
Decode
A
B
Register
Register M
file
file E
Write back
icode
Fetch
ifun
rA
rB
valC
Instruction
Instruction
memory
memory
valP
PC
PC
increment
increment
PC
PC
PC
pIcode pBch
pValM
pValC
pValP
27
Adding Pipeline Registers
valE, valM
W_icode, W_valM
Write back
valM
W_valE, W_valM, W_dstE, W_dstM
valM
W
valM
Data
Data
memory
memory
Memory
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data
Addr, Data
M
valE
Bch
Bch
CC
CC
Execute
ALU
ALU
valE
CC
CC
Execute
ALU
ALU
aluA, aluB
aluA, aluB
E
valA, valB
Decode
valA, valB
srcA, srcB
dstA, dstB
icode, valC
valP
A
B
Register M
Register
file
file
d_srcA,
d_srcB
Decode
A
B
Register
RegisterM
file
file E
E
Write back
valP
icode, ifun
rA, rB
valC
Fetch
Instruction
Instruction
memory
memory
D
icode, ifun,
rA, rB, valC
PC
PC
increment
increment
Instruction
Instruction
memory
memory
Fetch
PC
PC
valP
valP
PC
PC
increment
increment
predPC
pState
PC
f_PC
F
28
Pipeline Stages
W_icode, W_valM
W_valE, W_valM, W_dstE, W_dstM
W
valM
Fetch
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data



Select current PC
Read instruction
Compute incremented PC
M
Bch
valE
CC
CC
Execute
ALU
ALU
aluA, aluB
Decode

E
Read program registers
valA, valB
Execute
d_srcA,
d_srcB
Decode
A
B
Register
RegisterM
file
file E
Write back

Operate ALU
Memory

D
icode, ifun,
rA, rB, valC
Instruction
Instruction
memory
memory
Fetch
valP
PC
PC
increment
increment
predPC
Read or write data memory
PC
Write Back

valP
f_PC
F
Update register file
29
Summary
Today



Pipelining principles (assembly line)
Overheads due to imperfect pipelining
Breaking instruction execution into sequence of stages
Next Time



Pipelining hardware: registers and feedback paths
Difficulties with pipelines: hazards
Method of mitigating hazards
30
Download