It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS)

advertisement
It’s all about latency
Henk Neefs
Dept. of Electronics and
Information Systems (ELIS)
University of Gent
Overview
•
•
•
•
•
•
•
Introduction of processor model
Show importance of latency
Techniques to handle latency
Quantify memory latency effect
Why consider optical interconnects?
Latency of an optical interconnect
Conclusions
Out-of-order processor pipeline
I-cache
fetch
‘future’
register instruction execution
file
units
decode rename
window
LD
ST
INT
in-order
retirement
architectural
register file
Branch latency
fetch
ST
XOR
...
...
LD
OR
...
BR
OR ADD
...
LD
...
XOR
...
ST
latency
BR
...
INT
...
I-cache
‘future’
register instruction execution
file
units
decode rename
window
LD
ST
BR
time
Eliminate branch latency
• By prediction:
predict outcome of branch
=> eliminate dependency (with a high
probability)
• By predication:
convert control dependency to data
dependency
=> eliminate control dependency
Load latency
while (pointer!=0)
pointer = pointer.next;
execution units
LD
Loop:
LD R1, R1(32)
BNE R1, Loop
load latency = 2 cycles
branch latency = 1 cycle
LD
LD
CPI = 2 cycles/2 instructions LD
= 1 cycle/instruction
BNE
BNE
BNE
cycles
When longer load latency
execution units
• When L1-cache misses
and L2-cache hits:
load latency = 2+6 cycles
branch latency = 1 cycle
CPI = 8 cycles/2 instructions
= 4 cycles/instruction
LD
BNE
LD
• When L2-cache misses
and main memory hits:
load latency = 2+6+60 cycles
CPI = 34 cycles/instruction
BNE
LD
BNE
cycles
Memory hierarchy
register file
L1 cache
L2 cache
storage capacity
and latency
main memory
hard drive
execution
units
L1 cache latency
12
10
IPC
8
6
4
load/store
latency = 2
latency = 3
latency = 4
2
0
0
50
100
150
200
250
instruction window size (#instructions)
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
300
Main memory latency
3.6
3.5
IPC
3.4
3.3
load/store
3.2
3.1
3
0
20
40
60
main memory latency (ns)
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
80
100
Performance and latency
Interconnect type
Sensitivity of performance
to latency decrease
(% per ns)
Processor core/register file
39
Processor/L1-cache
19
L1-cache/L2-cache
3,0
L2-cache/main memory
0,18
performance change = sensitivity * load latency change
Increase performance by
• eliminating/reducing load latency:
– By prefetching:
predict the next miss and fetch the data
to e.g. L1-cache
– By address prediction:
address known earlier
=> load executed earlier
=> data early in register file
• or reducing sensitivity to load latency:
– by fine-grain multithreading
Some prefetch techniques
• Stride prefetching:
search for pattern with constant stride
20
31
42
stride: 11
53
64
e.g. walking through a matrix (row- or
column-order)
• Markov prefetching:
recurring patterns of misses
miss history
10 110 15 12
…
prediction
100
...
Stride prefetching
5.2
IPC
5.1
prefetching
load/store
no prefetching
5
4.9
70
75
80
85
latency main memory (ns)
IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress
90
Prefetching and sensitivity
Factors of “performance sensitivity to latency” increase
with stride-prefetching:
L1-cache/L2-cache L2-cache/main memory
to L1-prefetching
1.6
4.1
to L2-prefetching
2.5
Latency is important:
generalization to other processor architectures
Consider schedule of program:
time
Present in every
program execution:
• Latency of instruction
execution
• Latency of
communication
=> latency important
whatever processor
architecture
Optical interconnects (OI)
• Mature components:
– Vertical-Cavity Surface Emitting Lasers
(VCSELs)
– Light Emitting Diodes (LEDs)
• Very high bandwidths
• Are replacing electronic interconnects in
telecom and networks
• Useful for short inter-chip and even
intra-chip interconnects?
OI in processor context
• At levels close to processor core,
latency is very important
=> latency of OI determines how far OI
penetrates in the memory hierarchy
• What is the latency of an optical
interconnect?
An optical link
LED/VCSEL
receiver diode
fiber or
light conductor
buffer/modulation/bias
transimpedance amplifier
Total latency = buffer latency + VCSEL/LED latency
+ time of flight + receiver latency
VCSEL characteristics
optical output (mW)
• A small semiconductor laser
• Carrier density should be high enough
for lasing action
2
optical power
carrier density
1.5
load/store
1
0.5
0
0
1
2
current (mA)
3
Total VCSEL link latency
consists of
• Buffer latency
• Parasitic capacitances and series
resistances of VCSEL and pads
• Threshold carrier density build up
• From low optical output to final optical
output (intrinsic latency)
• Time of flight (TOF)
• Receiver latency
Total optical link latency
7
latency (ns)
6
5
4
3
load/store
TOF (10 cm)
receiver
intrinsic
threshold
parasitics
buffer
@ 1 mW
2
1
0
LED
LED
VCSEL
VCSEL
CMOS: 0.6 m 0.25 m 0.6 m 0.25 m
Latency as function of power
latency (ns)
8
LED (0.6 microm.)
7
VCSEL (0.6 microm.)
6
LED (0.25 microm.)
5
VCSEL (0.25 microm.)
4
load/store
3
2
1
0
0
1
2
3
4
optical output power (mW)
5
6
Conclusions
• When combining performance sensitivity
and optical latency we conclude:
– optical interconnects are feasible to main
memory and for multiprocessors
– for interconnects close to processor core,
optical interconnects have too high latency
with present (telecom) devices, drivers and
receivers
=> but now evolution to lower latency
devices, drivers and receivers is taking
place...
For more information on the presented results:
Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000
www.elis.rug.ac.be/~neefs
Download